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I 

THE HISTORY, TYPES, AND PROCEDURES OF FREQUENCY COUNTS 

SECTION I - NARRATIVE 



On a world scale, the counting of the frequency of occurrence of 
linguistic elements has had a long history motivated by diver «. purposes. 
Traditionally the counts have been of words, but they have also been of 
phonemes, morphemes, syllables or Idiomatic expressions. The purpose of 
these counts has usually been to develop a vocabulary of a special type 
such as of rare, frequent, useful, or Important words with the ultimate 
objective of developing vocabularies for the teaching and learning of 
stenography, spelling, or reading In the easiest aad most efficient manner 
possible. 

The counts with which we are familiar extend far baci; into the Med- 
iterranean world, where, in one Instance, we find that the scholars of 
Alexandria (Egypt) distinguished between rare and frequent words of Homeric 
Greek for the benefit of local students of Literary Greek. In the Tenth 
Century, the Talmud I sts categorized and counted the words In the Torah. 

In general, the West European history of frequency counts proceeds 
from a similar Impl.clt assumption that the best w^y to learn a language. 
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either your Gv;n» or someone else's Is to know which words you will mbet 
most often so that you can learn them first. Earlier lists tended to be 
Umito.d beginner's vocabularies usually of "useful » hard words". Such 
lists were conpilcd as early as the Fifteenth Century. 

Other early lists tended to be for occupational or Instructional 
purposes providing mainly name lists such as of b'rd!»» animals, parti, of 
the body and of occupations. 

These early lists tended to be restricted, special purpose vocabu' 
Ur.'es. and it wasn't until 1721 that Nathaniel Bailey attempted to compile 
an extensive English voca))ulary to serve as the basis for a dictionary. 

In 1^08, Timothc Bright published his "Characterie: An Arte of 
Shorte, Swlfte, and Secrete V^rlting by Character". It was the first known 
attempt at develop! ntj a form of shorthand, although it was not phonetically 
based. It was also the first attempt at developing a self contained, basic 
or "Island" vocabulary capable of bclrg used to express all necessary con- 
cepts with as few words (or symbols) and their variations as possible. 

In 16^49, Sulanus published fi list of words appearlntj In Homeric 
Greek and a little later a cleric named Wlnckler who lived In Hamburg 
annotated a Greek-Gemvin version of the New Testament to indicate v^ords 
occurring only once or only In a slnqle verse. A similar but more elaborate 
Dutch-Greek version of that annotation published In I698 over the signature 
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of Johannes Leusden indicated that there were ^,956 different words In the 
Dutch version of the New Testament of which 1,686 occurred only once or 
only in one verse. He marl<ed all those tliat occurred only once, all those 
that occurred twice or more, and indicated the number of new words occurlng 
in each chapter. 

During the Sixteenth and Seventeenth Centuries various counts of the 

vocabulary in literature were published in England, including one of the 
Authorized Version of the New Testament which In English- was determined 
to contain 6,000 different words. Ha^ever, we do not l^now the principles 
on which the counts were made; I.e., whether only headv/ords were counted, 
or whether derivatives and Inflections were also counted. The same uncer- 
tainty applies to the counts of this era of Shal<espeare's plays (21,000 
words) and Hilton's (0,000 different words in Par.5dlse Lost). 

Nirvctecnth Century Counts 

In the last half of the Nineteenth Century, two frequency counts of 
different types appeared which anticipated the proliferation of counts 
which has taken place in this century. The first was W. D. Whitney's 
phonetic study on the relative frequency of 10,000 sounds found in 10 
classics of EnnUsh literature as drawn from 1,000 sound samples from each 
of five prose and five poetic works. The word count type of study of this 
era is best exemplified by F. W. Kaedlng's count. 




Kaeding's work v/as the first and one i f the tarfiest of the r.todern 
frequency counts. With the assistance of nearly 6,noO contributors and 
-workers, Kaedlnq collected and counted some 11 million German words 

nd 20 million syllables f'om ]k categories of material. He found 250,173 
different words of v/hich ha]? occurred only once. The purpose of the 
count was to assist In teaching stenography rather than language, so homonyms 

: e listed only once rr-qardlcss ot v.ord meaning although derivatives 
»>nd inflections v/cre listed separately. In German, this meant some physical 

jparation in the alphabetical summary lists because of the phonetic spelling 
of unlauted letters; plurals being separated f.on their singulars and 

_rb forms scattered, makinr the determination of the total frequency 
]of a semantic concept a task Involving much hunting for related word forms 

nd adding of frequencies. (Morgan subsequently corrected many of these 

hortconlngs in bis revision published In 1928.) Nevertheless, this work 
Is Important not only for its size and detail, but because it firmly estab- 
**^hed the nethod of counting large numbers of words from a wide variety 
'f sources In order to find t:ruly general or representative words, and 

: iblished frequency of occurrence of q word as the basis for a determination 
:^f its linguistic Importance or value. This use of frequency Is more 
/alld for shorthand v^<ord lists than for other purposes, and has since 

ler modified by later researcher? prepar ng vocabularies for other purposes. 

Before the close of tiie Nineteenth Century, there appeared a more 
;:' word cotjnt in Ennlish. >t was J. M. Rice's "Rational Spelling 



Book". It is important mainly because it reveals another re'json tor fre- 
quency counting; the teaching of spelling of the real word, as opnosed to 
soine representation of ! t a» in shorthand. The frequency list waK designed 
to determine which words were used most often and, therefore, those which 
should be learned first. In spelling as well as stenography in which the 
v/ord Is more important than its meaning, frequency can be expected to 
adequately index the ImPo'tance of a word. 



190^»-1920 



In 1904, Reverend J. Knowles, In England, while developing his 
London Point System of Reading for the Blind, made a frequency count of 
Uteraiure, principally from the Bible, for a total of 100,000 running 
words. From this corpus he derived a list of 350 most frequently occurring 
words and indicated the frequency of each. This co nt Is of Interest 
because it was among the first to note that the first noun appeared as 
ncmber 73 in order of frequency rank. The preceding 72 were t.a so-cal U^d 
structural, se.r.i -structural , grammatical or relating words. This pheno- 
menon of the late appearance of nouns, resulting from the use of raw fre- 
quency data to determine the importance of a word for purposes of vocabu- 
lary compilation is, in fact. Inherent in ^^requency counts and h«is led 
later word counters to use various procedures to compensate for the dis- 
proportionate domi iance of the grammatical functors. Kenlston and Thorn- 
dike were ai.iong the first to make such monifl cat ions. 
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In 1910, W. E. Chancellor wrote an article entitled "Spelling: 
1,000 Words". He had derived his 1,000 words from some <tO,000 that he 
compiled and then screened. His work Is Important not because of the 
techniques which he used, which were neither particularly well documented 
nor sophisticated but because he employed personal correspondence rather 
than lUarature as his source material. Later Investigators using better 
techniques have made excellent use of such sources. 

In 1911, R. w. Eldrldge made a study of newspaper language In the 
Buffalo, New YorU area and published the results of his study as "Six 
Thousand Common English V/ords." He made the study for the purpose of dev- 
eloping a limited universal vocabulary. His courses didn't lead to much 
universality, but his work is cited because It: 1) involved counting only 
^^»,000 running words (which really are not enough to be statistically sound, 
much less universal) and 2) involved only one broad category of material. 
I.e., newspaper- English from four newspapers in one localized area. However, 
the count has been used as late as 1967 for purposes of comparison by Deier, 
Starkweather, and Miller In their "Analysis of Word Frequencies In Spoken 
Language of Children" (1967), a study they made of the oral vocabulary of 
grade school children in Salt Lake City. 

In the 1913-15 period, Leonard Ayrei compiled and published two 
works: "The Spelling Vocabularies of Personal and Business Letters" end 
"Measuring Scale for Ability in Spelling". He assessed so<ne ^00,000 
running words from 2500 people In 75 communities. About 70 percent of the 
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material was from personal and business letters. The data collected from 
these sources were used to devise his "Measuring Scale for Ability in 
Spelling". This rray have been the first use of word counts for testing of 
students. Ayres' work clearly demonstrated the fact that while languages 
have mary words, the frequencies of usage of these words are heavily sicewcd; 
an absolutely small number of word types accounting for the majority 
of the words used in ordinary discourse. For example, he found that his 
50 commonest words made up 50 percent of the materials he counted; 300 
words (in order of frequency) made up 75 percent, and 1,000 made up 90 
-percent. Later studies in confirming this staole finding stimulated interest 
in basic and limiteo (but universal) vocabularies. 

Cook and O'Shea In \3\k made their "Concrete investigation of the 
Material of English Spelling" based on the family letters of 13 informants. 
Cook and O'Shea found 5,200 different words in the 200,000 they counted. 
Allowing for the small corpus and limited range of the samples, the results 
: far as word usage Is concerned replicated Ayres' findings on the heavy 
:e of a few words, in particular the function words. They found that the 
Irst nine words on their frequency list constituted 25 percent of all words 
In the corpus, and the first k2 more than 50 percent, all of which were 
functors. Of the remainder, 963 words included 91 percent of all running 
v/ords even though another A237 were used one or more timas for the remaining 
3 percent of word usac,e. 



In 1920, Hayward Kenlston published one of the earliest of the 
foreign language counts prepared for the purpose of second language Instruc- 
tion. It was called "Connxjn Words In Spanish" and Is particularly noteworthy 
because Kenlston was apparently the first vocabulary compiler to recognize 
that word value or Importance does not depend on word occurrence frequency 
alone. It must, therefore, depend on some other factor or factors related 
to the uses to which the list Is to be put. Kenlston apparently believed 
that if representative or general-use words are desired, the number of 
different sources in which a word appears might be as Important as how 
often the word appears. If the occurrence of a word was restricted to 
relatively few sources or types of sources, Kenlston argued that such 
restriction indicated that the word was either peculiar to the author or 
to the subject matter and, hence, had lower value even though Its frequency 
in such sources might be high. In recognition of this fact, Kenlston noted 
the effect of range (or number of sources In which a word appeared) by 
using two lists. The lists were based on frequency of occurrence, but In 
one he placed only those words which qualified by appearing In 80 percent 
of his sources, and in the other he listed only those which appeared In at 
least 66 percent of his sources. These sources, incidentally, were mainly 
plays, but Included newspapers, reviews, short stories, and novels (all 
printed sources, except insofar as pseudo-oral material might be Included, 
particularly In plays, speeches or In quotations). 
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The Early Modern Counti. — 1920-1930 

In 1921, Edward Thorndike's seminal counts of English first appeared 
and heralded an era of his authority which was to last until today. Like 
Kaeding's count of German, Thorndike for the first time had carefully des- 
cribed the word frequencies of a massive corpus of -the written language. 

m 

In spite of criticism by Dewey, Palmar, Bongers, Rosenzwelg and 
McNeil among others, his works have survived. Thorndike, an educational 
psychologist, was probably the greatest exponent of oujective frequency 
counts as the basis for teaching, principally In the areas of reading, 
vocabulary building, and spelling, although his Teacher's Word Lists have 
also been used for preparing graded textbooks and achievement t<^sts. His 
first "Teachers' Wordbook of 10,000 Words" which appeared in 1921 was 
successively revised In 1931 to 20,000 v^ords and In 19AA (In collaboration 
with Irving Lorge) to 30,000 words. His initial list of 10,000 words was 
complied from Al sources and a total of four million running words. He 
enlarged It to nearly ten million running words from 2Al sources and 20,000 
selected words in 1931. Between 1931 and 19^^, he and Irving Lorge enlarged 
the corpus by additional studies of their own and tlie incorporation of prior 
studies by others. The resulting corpus was on the order of 23,500,000 
running words from which they produced their "Teachers' Wordbook of 30,000 
Words" In 19^^. 



One interesting feature Is that like Keniston, Thorndlke modified 
his evaluation of word importance as indicated by frequency by Integrating 
it with its range of occurrence, i.e., the number of sources In which It 
appeared, designating the Index the "merit number" of a word. The value 
of the so-called "njerlt number" was defined by the calculation, MN = f/IO + r 
f be inn the word frequency and r, the number of sources In which the word 
appeared. 

The range of the 30,000 list Is Imposing, Involving 285 separately 
listed sources Including those supporting previous counts by other re- 
searchers on reading, writing, and spelling vocabularies, the Lorge Magazine 
Count of one million words and a juvenile count of about 120 sources. One 
problem with the 19'*'* publication, however, is that fran the point of view 
of research. It omits much of the background material on procedures which 
appeared In the 1931 "20,000 Wordbook". It also omits a list of 135 words 
common to all sources Thorndlke used and groups the highest frequency words 
under gross occurrence categories ("AA" and "A") without differentiation. 

The material forming the basis of these counts Is now old, and much 
was old even in 1921, since about three of the first four million words In 
the original corpus came from the Bible. In spite of these drawbacks. It 
was much In demand and was reprinted as late as 1963. At the present time, 
it has been superseded for printed English for children by the American 
Heritage "Word Frequency Book" 0971). For the written language of adults. 
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it has been superseded by the "Computational Analysis of Present Day Ameri- 
can English" by Kul^era and Francis (I967). 

The revised basic Thorndlke Teachers' 20,000 Wordbook of I93I was 
supplemented in I938 by Thorndlke and Lorge In "A Semantic Count of English 
Words" which gave the frequency of occurrence of each meaning of each word 
of rhorndike's 1931 Wordbook, based on a detailed analysis of some 2,350,000 
words. 

In 19^9, Lorge published a revised and improved version of the 1938 
Semantic List of the 570 most frequently occurring words. 

These semantic lists helped correct a baste deficiency of those 
objective word lists which did not separate the frequency of the several 
meanings of the dictionary entry or the difference between the completely 
different meanings of homonyms which are separate dictionary entries. For 
that reason, the normal objective count in which pronounct atlon and form, 
not meaning, are Important, has been less satisfactory from a reading 
standpoint, although it may well serve its purpose for shorthand and spel- 
ling. 

In 1923, Godfrey Dewey, apparently dissatisfied with the lack of 
diversity of materials Included in English frequency counts made for the 
purpose of teaching of shorthand, published "The Relativ Frequency of 
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Enqllsh Speech Sounds". This is the first such study of sounds recorded 
since Whitney's In 187^. He corrected the problem of representative sel- 
ection of materials by sampling newspapers (editorials and articles), modern 
novels and short stories, speeches . personal and business correspondence, 
advertising, religion (Rible, sermons, journals and papers), science, and 
magazines. Dewey's error, while correcting lack of diversity, was in 
collecting samples that were too small. His total corpus was only 100,000 
running words, which Is about 1/10 what Bongers considers a minimum (19^7, 
pages 37 and 2^40). It Is notev/orthy that oral English was included in this 
count in the form of speech materials. 



in 1924, V. A. C. Henmon of the University of V/isconsIn published 
his oft-cited "A French Wordbook Based on 400,000 Running Words". More 
than 60 people contributed to the sampling and collection process, obtaining 
400,000 running words from nine different categories of printed and written 
sources. These 400,000 word tokens were found to represent 9,187 differer/t 
word types or orthographic variants, 1,250 of these occurring 25 times or 
oftener. Subsequently, Henmon publ!:hcd a separate listing of the 3,905 
v/ords which occurred five or more times In his count. His study originated 
in an attempt to find the influence of Latin on French, but developed 
to serve much broader educational purposes, particularly vocabularly selec- 
tion. 

I 

Dewey's tendentious attempts at English spelling reform have done little 
more than cause secretaries and type-setters to misspell the title of his 
art i cle . 




In 1926, Ernest Horn, as a result of several years of study of his 
own and the use of the research of others In the field of analysis of 
personal and business correspondence, published his well knor/n "Basic 
Writing Vocabulary". It was based on a total of 5,136.160 running words 
which provided 36,373 different words after omitting proper nouns. From 
these, Horn finally selected 10,000, although he found that the reliability 
of his count decreased rapldl/ after the first 1,000 with occurrence fre- 
quencies less than 77. He considered both frequency and range by use of a 
complicated formula to ensure due consideration of range In the final 
selection of words for the list. He deliberately left out all words with 
less than four letters since his Interest was spelling and he felt that 
words of three letters or less are not hard to spell. He also left out ^l 
common words, of the type that would probably appear In anyone's list of 
the first 100 most common English words. They were mainly short adjectives, 
adverbs, and pronouns. Omission of words with less than four letters has 
caused problems in trying to apply the list for general vocabulary purposes. 
The omission of the most common words, principally functors by the time of 
Horn's work, was a common practice motivated by an attempt to capture a 
greater number of substantive words ordinarily displaced by the ubiquitously 
occurring functors. 

Early Modern Foreign Counts 



In 1927, Milton Buchandn, working under the auspices of the American 
and Canadian Committees on Modern Lanquages, produced a "Graded Spanish 
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Wordbook". Its principal purpose was to provide material for graded vocab- 
ulary tests. Buchanan amassed 1,200,000 running words from a total of kO 
sources spread over seven categories of printed rnaterlal. He believed the 
range was adequate for classroom purposes Including oral practice In conver- 
satlon^even though no actual conversations were Included In the corpus. 
The dialogues drawn from plays were supposed to furnish the oral element. 
With respect to word Importance, Buchanan used the Thorndlko-Menmon method 
of combining range and frequency In order to obtain a merit number. Buchanan 
found 18,331 different words in his corpus, but reduced them to a list of 
5,32^1 occurring ten times or more with a range of l-i»0. Fol laving fast- 
developing practice but extending It even further, Buchanan also eliminated 
the 189 most common words from his general vocabulary, and published them 
sfiparately. These deleted I terns consisted of articles, conjunctions, 
numerals, pronouns, proper and geographic names, adjectives, adverbs, pre- 
positions and some very common nouns and verbs. Buchanan was cognizant of 
prior Spanish word counts such as that of Kenlston but there is no Indica- 
tion that he Incorporated them as Uander Beke did Menmon's In French. 

In 1928, B. 0. Moroan revised the Kaedlng Frequency Dictionary of 
the German language, also under the auspices of the African and Canidian 
Committees on Modern Foreign Languages. His purpose v^as to make the Kaoding 
list useful from the standpoint of teaching foreign language, in addition 
to stenography. To make the count more useful for general purposes, Morgan 
used the concept of basic or stem words and grouped under them all v^/ords in 
the Kieding count which had a cognate or semantic similarity and a frequency 



count of 200 or more. This grouping resulted In a list of 2, '402 stem words 
which he arramed In blocks of descending frsquency ranges. Morgan then 
prepared an alphabetical list of 6,000 words In which he listed the basic 
2,'i02 words together with any of their derivatives with a frequency of 100 
or more. 

Althouqh the Morgan revision made the Kaedlng count more usable, It 
did not correct the sample problems. 1; was still general, printed vocabu- 
lary containing no oral sampling as su.;h, was out of date (even In I898) and 
contained no specialized words (such as thosa required In the classroom) 
either in the main list or in any supplemental Ust. 

In 1929, Vander Beke, under the sponsorship of the American and 
Canadian Committees on ftodern Foreign Languages, published another French 
viord count called the "French Wordbook" which Incorporated (extended and 
updated) Henmon's earlier work. Vander Beke's corpus amounted to 1,U7,7'»5 
running words and 19,253 individual words. A cut-off at the range of five 
reduced the list to 6,067 words. Thejse he made Into a list using range as 
the main criterion rather than frequency. Vander Beke also listed the 69 
most common words separately, ai Part 1 of his study. The 69 all had a 
frequency on the Henmon list of ^50 or more and consisted principally of 
structural entries. 

Vander Beke set up his ba?/ic list of 6,067 In Part 11 in such a way 
as to show range and frequency each word in his independent count, the 
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Menmon frequency, and a conibln<»d frequency of each word. 



The combined corpus was over l,5CO,000 words. (Henmon's '♦00,000 
and Vander Oeke's own 1,1^7,7'*8.) Like Henmon's, Vonder Beke's sources 
were printed or written materials, some of which went back Into the mid- 
Nlneiecnth Century. In 1939, West and Bond reworked Vander Beke's list to 
make It more convenient for the teaching of reading by grouping derivatives 
under headwords and providing lists of Latin roots and French affines to 
assist In word recognition. Groups of related words, were listed In fre- 
quency groups of 100. 

In ]929*'1930, there appeared three supplements to single word lists 
in Spanls'i (Kenlston), German (Hauch) , and French (Cheydleur). These 
supplements consisted of lists to account for fixed collocations of words 
v^hich together conveyed a meaning different from the sum of the meanings of 
the Individual words. 

In 1930, C. K. Ogden published the first edition of his "Uasic 
English". It Is subjective list based on essential semantic concepts rather 
than the result of an objective frequency count. It is of interest, however, 
because, like some objective counts It contains a minimum essential or 
Island vocabulary. The number of words Is stated as 350, but the actual 
count may run as high as 2,000 depending cn hav variants of tl^e basic 850 
e counted. The purpose of Basle English was to produce an international 
Janguane. In the process, meanings and grammatical constructions of standard 
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English wc hanged so much that Basic English cannot be considered a mini- 
mum vocabulary of standard English, although Its words ^^re often compared 
with those resulting from objective count'; of standard English. Basic 
English was revised and expanded by E. C. Graham In 1968. Five years 
earlier. Lancelot Hogben published Essential World English as his replace- 
ment for Basic English as a universal langiage. Instead of usinc, an 850 
word list, Hogben (I963) reccmmended a 1300 Item list of what he calls 
Essential Semantic units. His llct largely avoids synonyms, hcmonyms , and 
dual meanings of any unit while embracing all necessary concepts. 

Early V/ord Counts As Vocabularies 

In addition to the American and Canadl,^n Committees on Modern Foreign 
Languages, one of the chief proponents of limited vocabularies of tngllsh 
for teaching purposes was the Institute of Research In English Teaching 
(IRET) sponsored by the Japanese Department of Education In Tokyo. Its 
head was Dr. Harold £. Palmer. Beginning in 1Q31, Palmer and his associates 
began to publish English vocabularies of J,00, 6OO, 1,000, 2,000, and 3,000 
words. As they were revised, the 1,000 word minimum word lists became the 
most popular. These lists introduced the idea of r.idius, v;h'ch was almost 
like frequency grouping, in that each radius list contained a predetermined 
number of most important words; 500, for example, and the next radius might 
have 1,000 words which would include the first 500 plus the next n»st 
important 500. These lists tended at first to be more subjective than ob- 
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jcctlve, but later bec.ime selective lists based on conjil derations of obJ»^.c- 
tlve frequency and ranqe as well as i^ubjectlve and emp rlcal considerations. 

In J93^-19^5, Mr. Michael West, under the sponsorship of the Carnegie 
Corporation, convened conferences to coordinate the efforts of the Jbjectlve 
word count, rs such as Tnorndike, ihe I RET group from Tokyo, and the 
teaciiers of English as a foreign ) ..n^^^jge. As a result of two major confer- 
ences, Dr. V83t and his a'scclates published the Committee report as the 
"Interim Report on Vocabulary Selection", in 1936. It includ.»d a list of 
2,000 General Service Words to be used as a basic vocabulary of English 
for foreign language students. Or. West, assisted by Or. Lorge, revised 
It Into a semantic frequency list based for the most part on five million 
running words. The list was arranged by word frequency, but with the fre- 
quency of the various meanings of each word Indicated by the percentage of 
the frequency value of the stem word contributed by each meaning. The list 
contains a supplementary list of ^25 popular scientific words to round out 
the basic 2,000 word list. Dr. West published the revised list and 1 fis 
supplement in 1953 as "A General ^.ervlce List cf English Words". In the 
late 1930's, West also published several other minimum vocabularies of from 
900-1,500 words, generally comparable to those of the first and second 
thousands of the I RET 3,000 word list. 

In 1937, Albert de la Court, who was teaching Dutch In Indonesia, 
produced a word count In Dutch which Included word combinations (Idlom-llke 
expressions). Its purpose was to assist teachers and textbook writers of 
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Dutch In l.idoneslan schools. It was a count based on 370 printed sources 
coverlnij both .^duU and children's books, f!ia>ia2l ncs , and newspapi5?"i In Java 
und In the Netherlands. From one mil Hon running words, 3,296 separate 
words and 2,000 collocations were determined. The unit of entry In the llr,t 
was the head or stem word. Under It were listed derivatives and compounds 
wit.i a frequency of 25 or more. Inflections were not shown on lists separ- 
ately but counted under the head word. Homonyms were listed separately. 
Word Importance was determined basically by its range* The number of derl- 

1 

vatlves and compounds of the entries were also noted. It was estimated that 
the tv^o lists embraced 95 percent of the material In an adult publication 
in Dutch. Words which fell within the 200 mo>,t common occurrences were 
not Included and were designated as "unccuntables". 

The de la Court list Is a general service list. For classroom use, 
he added a supplemental list of 67 words as an appendix. During the I930's, 
there were several attempts to improve word counts by combSnIng them as 
Morn and others had done. While the co-pora of the combined lists were 
larger than that of their component lists, the resulting lists inherited 
all the faults of the component lists except that of small size and restric- 
ted sampling. Beninning in 193'*, Helen S. Eaton began to compare the first 
6,000 words in selected lists of English (Thorndike), French (Vander Deke) . 
Gernian (Kaeding), and Spanish (Buchanan). Eaton started with word 
. frequencies, then expanded them into semantic frequencies of the several 
meanings of the words. Her idea was to identify and correlate common con- 
cepts as expressed in the most frequently used words in the four European 
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languages believing that If the speakers of these languages used counorv 
concepts frequently, these concepts might represent iaeas basic to mankind 
and te found In other langu^qes as well. Practical problems arose, however, 
in restricting the comparison to the semantic variations of only the first 
6,000 words on each list. While a stem word may rank In the first 6,000 
on a Jisi in a given language; that does not necessarily mean that any or 
all of Its meaning* will also do so. In addition, stem wordi lower In rank 
than the first 6,000 may have individual meanings that have greater frequfi.icy 
than some of the meanings of the f.tm words In the first 6,000. In a com- 
parison among four languages, the problems are quadrupled. For that reason, 
some possible correlations were not made, and some that were made on the 
basis of the frequency of the stem words equated very high frequency mean- 
ings of some stem words In on« language with very low frequency meanings of 
stem words In another language. The result Is that some significant con- 
cepts In one language are correlated with much less significant concepts In 
another language. Further, the study appears to assume that single word 
meanings alone represent concepts. Eaton finally published her completed 
work as "An English, French, German and Spanish Word Frequency Dictionary" 
In \^kQ, 

Perhaps the last of the reworklngs of earlier vocabulary lists 
through objective, subjective, and empirical means was Herman Bongers' 
so-called "K'-M List". This list represents a comparision of several prior 
lists containing three thousand or more words. From these lists, Bongers 
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derived a distilled list of three thousand words. He subdivided the 3,000 
words Into 1,000 word lists: K— the first 1,000; L--the second thousand; 
and M--the third thousand. Bongers was greatly influenced by Palmer and 
the KLM list is most comparable to the I RET 3,000 word list of 1932. How- 
ever, Thorndlke's list and a composite list by Faucet t and Maki were also 
considered. The KLM lists are arranged alphabetically with derivatives 
indented under their headwords and the Thorndike frequency grouping indi- 
cated, together with Oongers' rating when It differs from Thorndlke's. 
Inflectlcns are not listed except In the case of irregular forms, whether 
pluraU or verbs. However, inflected forms are considered In the frequency 
of the headv/ord and separate listings are made for homonym forms. When 
these lists v/ere tested analnst ten English texts, they were found to 
contain 97.^0 percent of all the words in those texts, with the K-list 
(first 1,000 vjords) accounting for fiS.kd percent of all the words found, 
thus, emphasizing again the snail part of total vocabulary we normally use. 
Oongers published his KLM list in 19^7 as an appendix to a comprehensive 
study of vocabulary building entitled "The History and Principles of Vo- 
cabulary Control" (19^7, Part III, 82 pages). 

A conparable, detailed study of word lists and vocabulary, Including 
frequency counts, entitled "English VVord Lists" by Charles C. Fries and 
A. Ai leen Traver, was first published in \0^0 and republished in I95O. Both 
the Fries and Traver, and the Bongers' books give excellent histories and 
discussions of vocabularies and frequency counts up to about 19^0. However, 
they frequently disaqree on their analyses of the problems Involved in the 



counts and in their opinions of the quality of the results obtained by 
Individual authors. Overall, Fries and Traver are move general, and more 
inclined to description than criticism. 

In the early 1930' s, under the influence of the IRET a number of 
Japanese investigators attempted to identify a minimum basic vocabulary. 
Most of the vocabularies so conceived were subjective and/or empirical, and 
contained from l,000-2»000 words. Since 1950, however, the Japanese fre- 
quency counts, especially those conducted under the sponsorship of the 
National Language Research Institute (NLRI) (or Kokuritsu Kokugo Ksnkyugo) 
have shown considerable sophistication In vocabulary building by statistical 
methods. The following three Japanese language studies are representative. 

Modern Foreign Language Counts 

Japanese 

In the early 1950' s. the NLRI started on Its "Research In Modern 
Vocabulary" which Investlqated the vocabulary used in women's magazines 
and cultural reviews. Part 1, published In 1953 (and often cited as a 
separate study), gave the report on the research based on sampling the text 
of one year's Issues of two women's magazines which were considered repre- 
sentative of that type of publication. A corpus of three million running 
words was compiled. Part II, published in 1958, gave a report on research 
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based on a samplint] of 13 cultural reviews which resulted In a corpus of 
230,000 running words. 

The analyses made of the findings In each case considered mainly 
the statistical and semantic structures of vocabulary and word construction. 
Much use was made of statistical sampling as oppose.' to the more laborious 
word counting done by Thorndike and Horn and their associates. Each of 
the parts contains a listing of the A, 000 most frequently used words. (The 
National Language Institute of Japan, 1953 and 1958). 

In the late 1950's, the NLRI undertook another study of vocabulary 
and Chinese characters found In modern magazines. It covered the fields 
of culture, business, popular science, housekeeping, sports, and other 
amusements. The NLRI published the results In three volumes In 1962-196^. 
Volume I (1962) was entitled "General Description of the Project and 
Vocabulary Frequency Tables". Samples Included 5^*0,000 words out of a 
total corpus of 1,A00,000 words. From the 5^0,000 words the 7,200 most 
frequent were published In various forms In a series of eight tables. 
Volume II (1963) contained the "Chinese Character Frequency Tables", giving 
not only the 1,995 most frequently used Chinese characters but also the 
total 3,328 Chinese characters officially used In Japanese. Volume Hi 
(I96A) Is called "Analysis of Results". However, It also gives much data 
not given in the first two volumes In addition to the details of the pro- 
cedures followed. For example, it gives the 1,200 most frequently used 
words with semantic classifications of the first 700 of them, the statlsti- 
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cal structures of vocabulary, bound forms (Idiomatic expressions), compound 
words, and homonyms. 

Both of the above Japanese studies were based on the printed material 
found In periodicals. The only study of comparable scope which has come to 
our attention Is one by H. Mlyajl published In 1971. Mlyajl built up a 
frequency dictionary from a 250,000 word sampling of Japanese fiction, 
periodicals, drama, didactic prose, and sclentl f I c wrl ting. Its full title 
is "A Frequency Dictionary of Japanese Words". 

Russian 

The first major count of Russian was published In 1953 by Harry 
Josselyn. It was basically a computerized analysis of literary Russian 
of the period 1325-1950. Its purpose was to determine word frequencies 
and frequency occurrence of categories of Standard Literary Russian. 

The percentages of the total material collected were 25 percent 
from the period 1825-1899, 25 percent from the period I90O-I9I8. and 50 
percent from the period 1919-1951. The styles range from drama, 7 percent, 
to fiction, 59 percent. Oral language Is Included indirectly since samples 
were selected to contain 37 percent literary conversation. The purpose of 
the count was to assist In the tec.ching of Russian as a second language. 
In common with recent practice, the count contains a list of the 20^> words 
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most likely to occur In all similar counts of Russian. These words are not 
Included in the count proper. 



In all, one million running words were recorded of which 526,0^4 
were actually used. Of these, ^1,115 were different words. From the 41,115, 
a list of 5.230 significant words was published in four lists of approxi- 
mately 500 words each and a final list of the remaining 3,000. *»*he first 
kSQ words were broken down by range, frequency, time period (In which writ- 
ten), type of literature, and categorized as conversation or non-conversa- 
tion. The remaining words were listed In rank order determined largely 
by range. This list can hardly be called current or colloquial, but may be 
of assistance In developing courses of Instruction for personnel who wish 
to read Russian. 

The second Russian word count Is that of N. ?. Vakar. Significantly, 
it is called, "Spoken Word Count". It is divided into two parts: Volume 1, 
Vocabulary (1966), and Volume II, Sentence Structure (I969). 



In view of Soviet Russians' reluctance to talk Into foreign tape 
recorders, (for Part I) Dr. Vakar resorted to an indirect method similar 
to, but more extensive than, Dr. Josselyn's. Vakar took 50-word samples 
from each of 200 acts of 93 plays, published between 1957 and 1964 to 
ensure currency. These samples provided a 10,000 word corpus which i? 
small by most standards, but which Dr. Vakar believed to be sufficient 
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for colloquial oral Soviet Russian. He found 2,380 dl fferent words In the 
10,000 word corpus. He also found that 360 of the 2,300 words account for 
73 percent of the words used In Russian conversation as represented by the 
samples . 

In Part 11, Sentence Structure, Vakar analyzed the material In terms 
of "kernel" sentences. Some 1 ,000 sentences were selected for analyses 
from a statistical universe of one million running words found In the same 
plays which were sampled for vocabulary In Part I. One of the findings Is 
that spoken collo^iulal Russian varies considerably from literary Russian 
and that short sentences of 1-5 words make up 75 percent of the total utter- 
ances In oral Russian. 

If we can assume, as the author does, that modern Russian drama Is 
a true representation of colloquial Russian speech, this Is an excellent 
statistical study of current-day oral Russian. The author validated his 
study by comparing Its findings with those of several other Russian word 
counts including Josselyn's and noted the differences and similarities. 

Spanish 

In Spanish, two word counts made in the early I350's deserve note. 
The first was done at the University of Puerto Rico by the Superior Tea- 
ching Council of Puerto Rico under the directorship of Or. Ismael Rodriguez 
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Bou with Dr. Lorge actinn as consultant. The word count is catted Spanish 
Vocabutary Count (Recuento de Vocabutarle Espanot). It is a modern computer 
compiled frequency count published In 1952 to provide for Spanish the 
teaching materials already existing In English through the efforts of Thorn- 
dike, West, and others. 



This Is a comprehensive word count embracing both active and recog- 
nition vocabularies, written, printed, and oral materials, and both adult 
and children's vocabularies. The total corpus is 7,066,637 running words, 
including Buchanan's corpus and that of an unpublished Spanish count made 
In Puerto Rico by two members of the Faculty of the University of Puerto 
Rico: Or. T. Casanova and A. Rodriguez, Jr. 

About half of the running words were active (speaking/writing) words, 
and the other half recognition (readi ng/l istening) words. The active vo- 
cabulary (about 3,390,000 words) was made up of children's oral, written 
(including the Casanova/Rodriguez input) and association inputs. The 
recognition vocabulary (about three million words) came from periodicals, 
radio programs, religious materials and the Buchanan corpus. In addition, 
there were about 700,000 words chosen subjectively by the authors from 
school texts and supplemental reading materials. 

The count also contains the results of the analyses of children's 
conversations and their association vocabulary. The children's material 
throughout was from elementary grades 1-6 except for the Casanova-Rodriguez 
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corpus which w.is taken from compositions written by children In grades 2-8. 
The oral vocabulary was taken down stenographlcal ly or recorded electron- 
ically. Great care was taken to obtain samples representative of Puerto 
Rico geographically and of children In all phases of their dally school 
life. For the oral samples, great care was taken to do the recording 
unobtrusively so that It would represent spontaneous conversation. The 
"association" vocabulary was of tv/o types: "controlled" and "free". Con- 
trolled association responses were evoked by stimulus words selected from 
a prepared list. The children were told to write all the words which oc- 
curred to them after the stimulus word was spoken. Free association lists 
were produced by asking school children to write down all words occurring 
to them In five minutes. 

Neologisms and regional Isms were Included In the corpus as were 
"coined" words not In standard dictionaries. If judged to be common among 
educated people. 

Frequency was the criterion for rank order of words In the lists. 
Inflectional forms were Included but semantic frequencies were not. 

The seven million running words resolved themselves Into 83, '♦SO 
different units: 20,5^2 lexical and 62,999 Inflectional forms. Part I 
of the count deals primarily with an explanation of the count and pfe- 
sentatlo.. of the first 10,000 lexical and first 20,000 Inflectional units 
listed in order of frequency and alphabetically. Part II contains all lexl- 
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cal units and their Inflectional forms classified by total frequency and 
frequency of appearance In various texts. Excluded from the count, but 
listed In an appendix are the 105 most frequent words of the count. 



The second count in Spanish Is that of Victor Garcla-Hoz: "Usual, 
Common, and Fundamental Vocabulary", published In Madrid In 1953. Garcla- 
Hoz also distinguishes between active vocabulary (speaking/writing) and 
latent (or recognition) vocabulary (listening/reading). However, he uses 
as the source of his corpus only four major categories of materials. He 
took a 100,000 word sample from sources In each category for a total of 
400,000 running words. The categories and sources or materials were as 
fol lows. 



A spect of Living 
Private or fami ly 1 1 f e 
Unregulated social life 
Organized social 1 1 fe 

Cultural life 



Category of Material 
Private letters 
Periodicals (Newspapers) 
Official documents of government, 
church, and labor unions 
Books and reviews. 



This Is an adult word count. It can be considered to include oral 
material only in the sense that private letters are part of active vocab- 
ulary and that words written may also be customarily spoken by the writer. 
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The 500,000 running words are distributed In descending order of 
quantity and ascending order of frequency in such a way that the usual 
vocabulary includes the comnnn and fundamental vocabularies and the common 
vocabulary includes the fundamental. SIgniricant data on the lists appear 
below: 

Vocabular y Description Words Average Frequency 

Usual Language of the 12,911 31 (^.2) 

common man 

Common Frequency between 1971 plus 172 (52) 

^0-399 and appears Supplemental 
In all four cate- list of 212 

or ies 

Fundamental High frequency 208 132^ (I32'») 

(^00*»up) are even ly 
distri buted among al 1 
four categories 



Looking at frequency of the categories in another way, If we tal^e 
the conimon vocabulary (which includtis the fundamental) out of the usual, 
the average frequency of the remaining words is '♦.2 or about one per cate- 
gory on the average. If we take the fundamental out of the common vocabu- 
lary, the remaining words have an average frequency of 52 or 13 per category 
on the average. By Itself, the average frequency of the fundamental vocab- 
ulary is 132'* or about 331 per category of material on the average. Thus, 
the fundamental v;ords are truly the commonest words in frequency and range. 
Words with high frequency (over ^00) but of uneven distribution were not 
included in the fundamental list. 26 words were left out for this reason; 
19 had too high a frequency in writing and seven had too low a frequency in 
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writing. In determininq the common and fundamental categories, the author 
made extensive use of mathematical techniques, Including factorial analysis, 
to establish correlations among the four categories of material and the 
three types of vocabulary. 

As a test, tlte author compared tlte words in the vocabulary with the 
language used in Spanish drama, to determine whether the words were colloq- 
uial and current. In this, he agrees with Vakar that drama contains most 
of the colloquial language of its time. In the normal "periodical" category, 
the author omitted sanpUng magazines on the basis that they are hybrids 
between newspapers and booths and their words would be included already. 

This vocabulary analysis, Wke that of the University of Puerto 
Rico, does not extend itselr to semantic frequencies, nor does it really 
involve oral language. However, this count is noteworthy for Its ordering 
of telescoping vocabularies and for Its mathematical computations of the 
correlations underlying the selection of words for inclusion In the common 
and fundamental vocabulary. 

French 

In French, there has been one recent frequency count of special 
interest. It was prepared by the National Pedagogical Institute for the 
French Ministry of National Education, from \SS^-]S(>^. It Is called Funda- 
mental French and consists of First and Second Level (Stagos) and an Elab- 
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oration on the First Level. Fundamental French (1st Level) (French Ministry 
of National Education, 1959), replaced "Elementary French" which appeared 
i in in response to Basic and Universal World English without the restric- 

tions on growth inherent in the Island Vocabulary of Basic English. The 
2nd Level appeared in 1962, to provide vocabulary and grammar for teachers 
of students who wanted to extend their l<nov/ledge of French beyond the 
necessities of dally life. The elaboration of the First Level (Coughenheim, 
£1 aj.' » 196^), provided the Hetal led background and procedures leading up 
to the Fi rst Level . 

The purpose of Fundamental French was to provide vocabulary and 
grammar for teachers instruction foreign students. The first level was 
fundamentally spoken or oral French, based on an objective and a statistical 
approach. There are some discrepancies between the explanations given In 
the report on the First Level and that in the Elaboration, but the general 
procedures and results arc given b<?lov;. 

Informants recorded their conversations on tape recorders as spon- 
■ taneously as possible under the guidance of research assistants. Infor- 

mants from all over France were interviewed. There was an effort to cover 
as great a variety of professions and vocations and as wide a range o* sub- 
, ject matter as possible to obtain representative samplings. The 275 infor- 

mants were mainly adults, about evenly divided among rnen and women, but also 
included 11 children of school age. There was also a good spread of educa- 
tional backgrounds among informants with perhaps the greatest percentage 
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(37 percent) having conpleted formal education through the primary grades, 
only. In all, a corpus of 312,135 running words was compiled yielding 
7,995 different words. The frequencies varied from 14,083 to 1, and the 
range from 163 to I (the material of the 275 Informants had been combined 
into 163 units for examination). For tiie purpose of the First Level (Basic), 
the lexical list was selected from words with a frequency of 29 or above. 
This provided a lexical list of 1 ,963 words. It was a frequency based 
list with range considered only to differentiate among words of the same 
frequency. When both frequency and range of two words were the same, the 
words were listed alphabetically. In the final list, the lexical units 
were arranged alphabetically, with no Indication of their frequency, since 
as far as teachers were concerned they were alt equally Important. In 
common with most counts we have observed, the most frequent words were the 
grammatical or s'iructural ones. In the French count, Interspersed at Icvfer 
frequencies in order of first appearance, were verbs, adjectives, and nouns. 

As would be expected from the above. It was determined by comparison 
with written counts that certain ^ary useful words^» particularly nouns, but 
alsj verbs and adjectives have only low frequency In written or oral counts 
tj-en from general or random samplings. These concrete v'ords applicable 
t" specific situations and subjects get crowded out of fre<,uency counts by 
the general usage words, of which the grammatical words are the prototype. 
The authors called these concrete words (which are needed even In a basic 
vocabulary but appear In general word counts with only very ]cm frequencies, 
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If at all), "available words". Everyone has to know/ them, but the occasion 
for their use occurs only Infrequently. In this way, "availability" becomes 
a second principle of Fundamental French along with frequency. To -determine 
what "available" words to use, the researchers resorted to a controlled 
association type of collection covering 16 Interest areas, such as "furni- 
ture", usinf) 90^ elementary school students aged 12-13 of the Departments 
of France. Each student supplied 20 words per subject area. Those of 
hiQhest frequency were added to the First Level vocabulary. 

Although a semantic frequency count was not made, the words on the 
list were checked for meaning and where concepts essential for educational 
or communicative purposes were missing, words to convey them were added. 
This procedure added about kOQ words. The list was then culled to eliminfite 
certain words which, although warranted by frequency, were close synonyms 
of v^ords of higher frequency, were vulgar words, difficult to learn, or for 
some other reason failed to conform to the objectives of Fundamental French. 

The final list of the First Level contained 1 ,'♦'♦5 Items; 1,176 lexi- 
cal words and 269 granwnatlcal words. The grammatical words chosen were the 
ninimum deemed required to permit flexibility in the use of the language. 
The lexical list had a tjeneral alphabetical list of all words, followed by 
special lists of related words such as numbers, days of the week, months 
of the year, and seasons. The list was kept deliberately general with the 
exception of the items indicated above. It was designed to be a minimum 
vocabulary to which specific additions could and would be made by teachers 
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accordinq to the environmental needs of their pupMs, especially region- 
al isms to adapt the standard tanguaqe to the needs of particular geograph- 
ical areas of France. 



for teachers of those who wished to go beyond the ability to express 
the daily needs of life and to acquire a more complete knavledge of French 
and French culture, Fundamental French (Second Level) was developed. Unlike 
the First Level which is largely based on the oral frequency count, the 
Second Level Is based on the written language and includes additional gram- 
matical terns, In orde.- to provide the student considerable flcxIbllJcy of 
expression and an ability to read newspapers and books. 

The First Level took In words from the original word count davn as 
far as frequency 29. The Second Level lowered the threshhold to 20 or 
above, and included many of those above 29 which had been rejected as not 
required for the First Level, particularly those which were eliminated by 
reason of duplication of basic concepts. The Second Level also adopted 
nany of the terms on the association lists of the 16 Interest areas which 
had not been deemed to have sufficient frequency to warrant inclusion on 
the First Level. In addition, the authors took words from the Vander 3eke 
list with a frequency of dO or more, even thouqh that llu: was both literary 
and dated. Next, they undertook new investigations and short counts to 
update Vander Beke's count. One field was newspapers and magazines. The 
researchers counted words appearing under lA subject areas In the newspapers 
and nanozinos and added an average of 35 words from each of the subject 
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areas if not already in the First Level list as amended. These additions 
amounted to k25 words with a frequency of 13 or higher. Further, using the 
association method, 160 students at teacher preparation institutions 
throughout France furnished lists of psychological terms. Those used by 15 
or irare of the informants v/ere added to the list. Finally, the li"it v/as 
submitted to a panel of experts who added such words as deemed by them to 
be required to meet the purposes of secondary level French instruction. 
Like the First Level, the Second Level of Fundamental French contains an 
alphabetical list of lexical units, and a section of grammatical words. 

Note that in Fundamental French, the vocabulary lists are a combin- 
ation of objective frequency counts, empirical inclusion of concrete words, 
exclusion of duplicatinrj words and those of low frequency, and inclusion 
of other words based on empirical association by students, and an addition 
of still others based on the subjective judgment of panels of experts. 

Fundamental French Is of interest not only because the first level is 
oral but because it provided a point of departure for Dr. J. Alan Pfeffer 
of the University of Pittsburgh in a study of oral German which will be 
discussed next. 

German 

In German, there have been three recent studies— one general and 
oral by Pfeffer, one on newspaper vocabulary by Rodney Swenson of Hamline 
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University in St. Paul, Minnesota, and one by Scherer of the University of 
Colorado on the "Short Story In the Second Quarter of the 20th Century", 
Dr. Pfeffer's Is the more representative of the three. Since It Is bas- 
ically oral, It also Is best suited to our purposes. For that reason. It 
will be described briefly. In many ways, It Is one of the best of the 
modern word counts, having profited from the faults of prior studies. So 
far, of eleven expected publications to result from his study, Pfeffer has 
published three: Basic (Spoken) German V/ord List (196^), Index of English 
Equivalents for the Basic (Spoken) German V/ord List (1965), and Basic 
(Spoken) German Idiom List (1968). Before undertaking his study, Pfeffer 
reviewed the field of viord counts and noted the best features of the recent 
ones, especially the Spanish Word Count produced by the University of 
Puerto Rico (Rodriguez Sou, 1952), and Fundamental French, produced under 
the auspices of the French Ministry of National Education (1959, 1962) • In 
general, Pfeffer appears to have followed, but improved upon, the proce- 
dures used by the authors of Fundamental French (First Level) and provided 
oral, topical (utility or available) and empirical Inputs to his own corpus 
of oral German. 

The first step was the collection of the oral vocabulary. This was 
done by means of taped Interviews on 25 human Interest subjects. The 
Interviews took place In 56 cltle«i and towns In Austria, German-speaking 
cantons of Switzerland, and V/est Germany. Basic data such as age, sex, 
educational background, vocation, and type and size of residence were re- 
corded for each Informant. 401, 12-mlnute recordings were transcribed and 
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the words placed in context on ADP punch cards. In this process, proper 
names, r'ace nanes, and faults of speech were deleted as being peculiar 
to the Individual or place. In this way, an oral corpus of 595,000 lexical 
units was derived from which nearly 25,000 separate lexical units were 
l-iolated. Inflections were subsumed under their hcadv/ords, but their 
frequencies separately recorded. The frequency varied from 50,256 to 1 and 
ranne (of speakers) from ^50 to 1. (Some Interviewers' conversations were 
also included, so the ^01 Interviews developed Into a range of ^50 speakers.) 
From the 25,000 separate lexical units, nearly 1,000 representing the most 
connon words with frequency at least equal to ^0 and range equalling at 
least 25 were selected for further analysis. The analysis was concerned 
Plainly v/ith applicability, universality, and indl spensabi 1 1 ty . This 
screeninq process reduced the list from 1,000 to 737 spoken words. (The 
oral part of the corpus.) 

The utility (topical) words were collected by controlled association 
in 82 intermediate and academic high schools In German, Swiss (German 
speaking) and Austrian cities and towns. The informants were about 15-16 
years old of both urban and rural backgrounds, and about equally divided 
as to sex. The students were given a stimulus topic seleC d from a list 
of 21 such subjects, such as "buying and selling". They were then given 
ten minutes to write do^m 20 nouns (or 12 verbs and eight adjectives) 
related to the stimulus topic. (Whether nouns or verbs and adjectives were 
to be collected and on what topics was specified In the request to each 
school.) The effort yielded a topical corpus of 833,000 terms from which 
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19,700 nouns, 6,800 verbs and 7,^00 adjectives were derived. Applying the 
the criterion of aopl I cabi 1 1 ty to the topical list narrowed It to 3^7 nouns, 
verbs, and adjectives. 

In the empirical stage, the 737 oral words were combined with the 
3'»7 topical words and all of then were examined together for gaps In se- 
quence, derivation, opposltes, topical limitations, parts of common com- 
pounds, and common concepts. The result was the addition of 185 words to 
round out the basic list. About three-fourths of the words had already 
Ue&n considered In the uncut oral or topical corpora, but had been eliminated, 
generally because their range, frequency, or both had been too low. The 
resultant total word list consisted of 1,"67 words. They were presented 
In alphabetical order (by family groups), then by parts of speech, and 
finally In order of frequency and range. 

The Index of Enqllsh Equivalents (1965) gives the most cormon 75 
percent of semantic meanings, and Indicates the percentage of the headword 
represented by the frequency of each meaning listed. From this list, 
teacr^^ can easily determine which of the several current meanings In 
oral usage are of most Importance for students to learn. For background on 
handling semantics. Dr. Pfeffer leaned heavily on Dr. Urge's treatrrent In 
his semantic analysis of the 570 most common English words published In 

In the Idlon list, Pfeffer defined an Idiom as a "semantic restrlc- 
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tion of syntactically collocated parts". In the gathering of the corpus of 
595,000 running words for the basic oral vocabulary, Pfeffer identified 
nearly 25,000 single words and 7,500 phrases. From the 7,I>00 phrases, he 
extracted 1,026 Idioms, an additional 99 idioms from the utility and empir- 
ical studies. The total of his idioms is thus \,\ZS. In hts study of 
idioms, Pfeffer compared his list with that of Hauch published in 1929, and 
indicated which of the items in his list were also in Hawch's. 

Or. Pfeffer estimates that his Basic V/ord List, Semantic Equivalents 
and Idioms account for about 35 percent of the free forms, and of the 
restricted forms and patterns. In colloquial German speech of the present 
day. 

Swah 1 1 1 

In Swahlli, there appears to have been no major or comprehenlsve 
frequency count of the written or oral language. There have been subjec- 
tive and empirical studies made which resulted In grammars and dictionaries, 
havever. Bilingual dictionaries, for example, have appeared In several 
European languages: Swahi 1 i -Eng H sh (French, German, Polish, and Russian). 
Missionaries started cof,)piling grammars and vocabularies, which grew into 
dictionaries, as early as the I850's. Dr. Krapf published his dictionary 
in 1SG2 followed by Madan in 1903. Perhaps the best known dictionary in 
English was compiled by Frederick Johnson in 1939. In the same year, a 
well-known French-Swah i I i dictionfiry compiled by Charles Sacleux appeared. 
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One of the latest In Enqlish was published by 0. V. Perrott in 1070. VJI th 
regard to grannars, one of the first In Swahili appeared In 1850, prepared 
by the sane Dr. Krapf who later published one of the earlier dictionaries. 
His grammar was shortly follov/ed by Bishop Steere's In I87O. Most of the 
grammars contain vocabularies for each lesson and a glossary of all words 
used as an appendix. While many are more Interested In translation and 
writing than conversation, there Is an Increasing number v^hlch devote 
considerable space to conversation, as exemplified by the publications of 
the Foreign Service Institute of the Department of State which Includes 
"Swahlll— an Active Introduction (Conversation)" (Stevlck, et, aj^. , 1967). 
Other good qrammars are Edgar Polone's "Swahlll Handbook" (I967) , and D. V. 
Perrott's "Teach Yourself Swahlll" (1951, 1967). The Belgians have also 
been Interested In Swahlll because of their Interest In the Congo, parti- 
cularly in Katanga where a dialect form of Swahlll called KlNgv/ana Is 
spoken. In the ID^jO's, Van den Eynde developed his "Grarrenalre Swahlll", 
(19'«M, but considered the Katangan dialect so bad he concentrated on the 
so-called Standard Swahlll of the East Coast. On the other hand, E. Natal Is 
In a three volume work called "La Langue Swahlll" which appeared In I965, 
addressed principally the dialect of Swahlll spoken In Katanga. 

In recent years, there have also been some specialized studies, 
principally by students and scholars on the various aspects of Swahlll 
Grammar. In the United States, there seems to have been a concentration 
on the verb. Carol Eastman made a study of verbal extensions (1967). Carol 
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Scotten delved into the extended verb system (I967), and Rae Moore made a 
study of verbal derivations (1966). On the other hand. Judith Ollnick 
became Interested In exploring transformational grammar as It relates to 
certain noun phrases (1967). In spite of these recent studies, many of them 
done with the aid of tape recorders and computer manipulation of results, 
there Is still a need for an extensive frequency count In this language, 
similar to the latest ones done In the European languages and Japanese. 

Comparative Studies 

In the field of comparative linguistics, Kucera and Monroe (1968) 
published "A Comparative Quantitative Phonology of Russian, Czech, and 
German". This study attempted by comparative analyses to determine the 
value of a statistical approach to historical phonology by studying the 
differences and similarities in historically related or geographically 
contiguous languages. The study was based on the printed word, principally 
prose fiction (60 percent) with half the rest of the words taken from 
periodicals. As a result of their study, the authors concluded that a 
close genetic relationship of two languages (e.g., Russian and Czech) is 
likely to shav up at the phonological level in similar phonotactlcs but 
not necessarily in similar phonemic systems. On the other hand, languages 
In close geographic contact (e.g., Czech and German), may well show the 
greatest similarity at the phonological level In phonemic Inventory, with 
much less sinilarlty In their phonotactlcs. 
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Modern English Studies 



Structural Analyses 

In the field of structural analysis, the first study was on tradi- 
tional frequencies of English phonemes by Hultzen, Allen, and Miron . 
The corpus for their study was developed by Professor Aga rd of Cornell for 
some of Or. Carroll's studies. It consisted of material from II plays and 
selections from the Journal of Modern English. From these sources, some 
20,000 phonemes were collected In phoneme sequences. The phoneme corpus 
was manipulated by computer to produce displays with supporting tables of 
the number of occurrences (I) of each phoneme, (2) of each two phoneme 
sequence, (3) of each three phoneme sequence, and (A) of each four phoneme 
sequence. 



The second study of structural English by A. Hood Roberts extended 
the Hultzen, Allen and Miron counts by making a quantitative analysis of 
tho segmental phonemes contained in Horn's "A Basic Writing Vocabulary of 
10,000 Words" (1926) and Lorge and Thorndike's "A Semantic Count of English 
Words" (1938) supplemented by Lorge's "Semantic Count of the 570 Most Com- 
mon English Words" (19^*9). The Horn vocabulary items were spoken lists in 
sentence patterns and recorded on tape in north central dialect. The 10,000 
words were transcribed phonenically and their etymologies tabulated. The 
results were then manipulated by computer and analyzed to produce tables 
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listing the frequency of occurrence of the phonemes, average word length In 
phonemes, transitional probabilities of phonemes, and the etymologJcal com- 
position of English according to proximate sources (e.g., French as compared 
to the more remote Latin). 

Adult Oral V/ord Counts 

There have been six Important frequency counts of oral English since 
1950. Two are of children's speech, two of college students and two largely 
of the general public. 

TJie first, in 1955, was that done by Black and Ausherman of the speech 
of students In classroom situations. Actually, the college students were 
servicemen of college afie and background who were taking college courses 
in preparation for beconinn military meteorologists. The Informants were 
27^ male students who participated in 607 five-minute classroom speeches 
of which three and one half to four minutes of each were recorded. The 
students were unaware of the recording. The students were actually giving 
nearly extemporaneous speeches on material connected with meteorology or 
its background subjects, and related to its military application. The 
students had prepared outlines of the topics to be covered In their talks, 
but otherwise the speeches could be considered spontaneous. 

The informants as a group were mid-westerners, highly Intelligent, 
had good prior scholastic credentials, and high aptitudes In mathematics. 
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As a group, they might be considered atypical on the side of high aptitude 
In mathematics and high degree of prior academic achievement. 

A corpus of 285,152 running and 6,026 different words w*s compiled 
with frequencies ranging from 15,000 to I. Conparision with the Thorndil^e 
Teachers' Wordboolt (19^^) (printed English) showed differences and Incon- 
sistencies. There were many words In the Thorndilce list which were not In 
the oral list, and vice versa. The discrepancy amounted to about ten percent 
of the oral list. Thorndike's first 1,000 words accounted for only UO 

percent) of the first 1,000 words of the oral list. Comparislons were 
closer in the case of Godfrey Dewey's Relativ Frequency of English Speech 
Sounds (1923). Dewey's first nine words making up 15 percent of words used 
amounted to 22 percent of the oral list. All the first 50 most common oral 
words were found In the first 83 of the Ocwey list, and all but three of 
Dewey's first 50 were found in the first 100 oral words. 

These comparisons with the Thorndike and Dewey lists are not 
entirely appropriate since the two printed counts are considerably dated. 
Other differences were Introduced by the fact that the Informants tended 
to neologisms, slang, occupational jargon, and colloquial compounds largely 
related to their prospective work in the military and the cultural subarea 
in which they were raised. 

The second so-called adult oral word count was conducted by Davis 
Hov/es in the Boston, Massachusetts area during the period I96O-I965 (I966) . 
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Infornants were 20 sophomores at MIT and Northeastern University and 21 
patients at the Veterans' Administration Hospital In the Boston area. The 
21 patients were free of cerebral defects and any debilitating disease. 
kO of the Informants were taped In the course of free speech delivered In 
response to general questions designed to produce natural and Si^ontaneous 
speech. This procedure was kept up until 50,000 words had been jbtalned 
frori eacl. Informant. The 41 st Informant provided ten of the total 50 inter- 
views In order to give data on the stability of word frequency. The total 
corpus was 250,000 words which were cataloged by source and origin, i.e., 
school or VA. ^,699 individual words were Identified, but i|,on7 (i»7 percent) 
of then occurred only once. 

The study confirmed findings of others that oral language uses 
fewer words (has a lower type/token ratio) than printed/written English 
and that only very larqe counts of running words would reveal very rare 
words. In contrast to most counts, popular and place names were recorded 
and counted as well as certain utterances which were non-words and/or 
markers (e.g.. nnn, uh, etc.). 

liaves undertook the count to update prior counts and correct de- 
ficiencies in then; i.e., Thorndike lacked an oral input and the Del I 
Telephone count of 1930 (French, Carter, and Koenlg) collected speech 
sounds useful for technical purposes but In a manner not Jlkely to provide 
assistance in a count of normal spoken vocabulary. 
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The third "adult" oral word count v/as published by Lyle V. Jones 
and Joseph M. Wepman in ig66. This count sanples the utteranc 
adults. They ranged In afje frorn 18-80, but were mostly In the older half 
of the bracket. Educational backgrounds of the Individuals In the group 
varied from 2nd grade to PhD. 20 ,. Icture cards from Murray's Thematic 
Apperception Test of 19^3 were used to stimulate spontaneous conversation. 
The mean number of words per subject thus evoked was 2,527, with a range of 
1,032 to 5,276. The total corpus was 136,^50 running words. The results 
were tabulated and manipulated by computer to provide three lists: 

A. The 1,102 wor-is most often used by the 5k speakers, down to a 
frequency of ^/l 00, 000. 

B. V^ords with a range of at least 2, arranged by grammatical cla s 
and alphabet! cal ly w I thin class, 

C. List 0 In straight alphabetical order Includlngt Inflectional 

forms . 

The resuits showed little difference In word diversity between male and 
female or between those over and under 60, but distinct differences among 
socio-economi c-educational groups . 
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This limited study indicates that 33 words account for 50 percent 
of the oral v/ords used. This Is half as many as estimated for the written 
and printed langua<ies and generally confirms earlier studies In this respect. 
Jones and Wepnan attribute this lesser diversity of oral speech and ten- 
dency to repeat frequent words more often in talking than writing to the 
fact that meaning Is conveyed In face-to-face contact by bodily movements, 
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facial cxpresslors, eye contact, and Intonation, whereas additional words 
are required In writing to ensure that the intended Ideas are, In fact, 
conveyed. 

Zipf's Law of the inverse relationship of word length to frequency 
was borne out by this study _p to a word length of four letters, I.e., 
down the fr^squency list to about the 100th word In order of descending 
frequency rank. 



The fourth "adult" oral count by Kenneth Berger In 1967 Is entitled 
"The Most Comnon Words Used In Conversation". It Is mentioned because of 
the "conversational" aspect and its cl.-mdestlne (perhaps unethical) oethod 
of the collection of Its corpus. Others have despaired of obtaining really 
spontaneous speech, e.g., the field workers for Fundamental French (1959) 
and Pfeffer In his collection for Basic (Spoken) German (196^). As a result, 
mo'.t spoken speech samples, until Berger's count, have to some extent lacked 
complete spontaneity. Hok>»ever, Berger was able to obtain unguarded conver- 
sations from bars and restaurants. His unwitting Informants were largely 
white, male, businessmen, white collar workers, and skilled laborers. There 
seemed to be few professional, farm, unskilled workers or students involved. 
The speech collected is that of Kent, Ohio and its vicinity. Berger deve« 
eloped his o^/n criteria for acceptance of utterances which niake his study 
somewhat different Ir methodology as well as subject matter. He accepted 
as sentences utterances of as few as two words which had a predicate or was 
a com^olete, although laconic, answer to a query. Slang, curse words, mis- 
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pronunciations, and uncjranmati cal expressions were accepted. No more than 
four sentences from any conversational group were accepted to ensure variety 
within the snail corpus (25.v100 running words). Words normally eliminated, 
such as family name, place names, and other specific nouns, are listed in 
appendices. Oer<ier tabulated variants under the stem word unless the forms 
or variants added a syllable. Forms or variants with a different number of 
syllables were given separate listings If the variant and Its stem word 
each had a frequency of more than one, and If the variant and Its stem word 
v^ere both used with about the same frequency. The number of sentences 
transcribed was 2, MO, with a mean sentence length of 6.7 words, represen- 
ting 2,507 different words. Almost half of the 2,507 words appeared only 
once. Significant findings Included: (I) frequent use of "1" and "you", 
(2) use of indefinite and relative pronouns in lieu of nouns, (3) simplicity 
of language, and ik) confirmation of Zlpf's five generalizations regarding 
inverse ratios, of word length and frequency, and the number of v^ords used 
and frequency. Speculative findings are that conversational speech vocab- 
ulary is extremely sensitive to place, time, and current events and Is 
subject to rapid evolutionary change. 

Children's Oral V/ord Counts 

The Beler, Starkweather, and Miller (I967) study was undertaken to 
determine the psycholoni cal parameters governing children's conmunl catl ons 
and also to determine whether Zlpf's Laws as derived from printed/written 
counts were applicable to spoken counts of children's language. 
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'.c experiment took place In grades 6 (age 12) and 10 (age 16) In 
the Salt Lake City Public Schools. The 30 Informants were all boys; half 
in each grade. They were selected to have a normal IQ range (90-110). 
In addition, data on their scholastic performance was obtained and recorded. 
The stimulus material Is not stated In the report, but each boy recorded 
about 5,000 words from which about 2.700 were selected and compiled Into 
two AO, 000 word corpora (one for each grade) for a grand total of 80,000 
words. Five one-minute samples of each boy's contribution were timed 
to obtain a rate of speaking for each Informant. Comparisons were made 
between age groups and with the Eldrldge frequency count of newspaper 
English In the Buffalo area In 1911. 

The results tended to confirm prior findings of greater varlfjty 
of expression In printed language thon In speech. However, the validity 
of the results may be undermined by the fact that adult newspaper E.igllsh 
JUL the Buffalo area was compared with the oral language of school 
children in the Salt Lake City area In 1966 . It should be expected that 
adult oral conversation would show greater diversity and variation than 
that of children of the aqes used In this study. It would, therefore, 
have been better to have compared this count of children's oral English 
with that of a printed count of about the same Hate. In any evunt, the 
find Inns confirmed that for those two age groups and the small corpus 
obtained, Zlpf 's Law applied to oral ai well as printed langu.ige. Specifi- 
cally, the number of words of a given frequency Increases as the frequency 
of uses decreases, and the shorter the word the more frequent Its occurrence. 
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The study also determined that eight of the first ten words on the 6th 
and 10th grade lists were the same, although not in the same order. Other 
findings indicated that the 16 year olds, as compared to the 12 year olds 
spoke faster and used significantly more positive and negative words, 
slightly more singular self-reference words, slightly fewer plural self- 
reference words, more "other" references, and slightly more "question" 
words. At equivalent intelligence levels, age made little difference 
in the ratio of different to total words. 

The second spoken word count of children's language was that of 
V/epman and Mass published in I969. The children In this study were of 
ages 5-7. The count was undertaken to update and extend prior counts 
of the oral lanquage of children in order to obtain information on grammatical 
development, semantic extension, and vocabulary increase as corre Hated 
with chronolopical age. The informants were 90 children {^5 male and 
^5 female) equnlly divided among anes five, six, and seven. They were 
all from middle income homes and large urban areas well distributed around 
the United States. All were uni-llngual English speakers and had no apparent 
mental or physical handicaps. 

The Murray Thenatic Apperception Test of 19^3 was used, with each 
child asked to tell stories about 20 picture cards. The material was then 
manipulated by conputer. 
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The results were arranged In three lists for each age group, as 

fol lows: 

A. V;ord frequency order (for all words with a range of at least 
two) of stem v;ords. 

3. Words by grammatical class, alphabetically within class. If a 
word was used as two parts of speech it was listed under each. 

C. Alphabetical — Including all inflectional forms and grammatical 

uses. 

The report states no conclusion, but Introduces two new concepts— a 
"nean" frequency for each age group on the basis of 10,000 words and all 
30 informants, and a "variation" which represents the difference In the 
frequency of use of the word by high and low u*>ers as compared to the total 
number of users. A high variance Index indicates that soi.ie children use 
the word frequently and some very little, and Is, therefore, another Index of 
what other researchers have called range. It is useful in compari ng words 
of equal mean frequency of use since It permits estimating wi.ether the mean 
frequency represents general use or u'-e by only a few. 

Printed V/prd Counts 

In the field of printed counts, two good reports have appeared re- 
cently; one by Kucera and Francis on adult language and one by Carroll, 
Oavles, and Richman done for American Heritage on children's language. 
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In 1967, Henry Kuccra and W. Francis published "A Computational 
Analysis of Present Day American English". It essentially replaces the 
Lorge and Thorndike "Teachers' Word List of 30,000 Words" published in ]Skk, 

A corpus of nearly one million words was compiled from recent and 
current publications, dating from I96I. To ensure adequate coverage, 
15 categories of material were Includedj newspapers (editorials, and 
reviews); religion, skills and hobbles, popular lore, literature and biography, 
government documents, learned and scientific, fiction (five—general, 
mystery/detective, science, adventure /we stern, and romance/love story), 
and humor. 

500 samples of 2,000 words each of continuous discourse were randomly 
selected for transcription and computer analysis. The results of the *.naly- 
sis were displayed in two ways: word l!si.s and statistical tables and graphs. 

The word lists are principally of three types: (1) descending order 
of frequency, (2) alphabetical, and (3) the first 100 » «t frequent words 
by total frequency and by frequency In each of the 15 categories of sampled 
materials. The statistical tables tabulate both word frequency distribution 
and sentence length distribution. 

The grammatical (sentence length) analysis with frequency distri- 
bution is an added dimension to English V/ord Counts. For thu samples as a 
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whole, the sentences hdd a word length of \S.2k words with sentence 
length ranging from 25. '♦S words for governmental documents to 12.76 for 
fiction and mystery stories. 

Although this is an excellent example of a current objective word 
count, it would have been better had It Included a semantic frequency count 
as v^/ell as a lexical count. 

In 1971. John Carroll, Peter Davles, and Barry Rlchman completed a 
word count of the printed language of children. It was published by the 
American Heritage Publishing Company and the Houghton-Mifflin Company as 
"Word Frequency Book". Although it is useful for many educational purposes, 
it Is primarily I n tended-- 1 Ike West's Definition Vocabulary of 1935— as the 
basis for a dictionary; this time a revision of the American Heritage School 
Dictionary, ii Is based on the printed language to which public and paro- 
chial school children in the United States are exposed In grades three 
through nine. The samples were taken from publications covering 22 subject 
areas: 17 curriculum areas, three library categories, magazines and re- 
ligion. The curriculum categories alone sample 1,0^5 Items (texts and 
other published materials) recommended by nearly half of the schools which 
responded to a questionnaire concerning published materials used by students 
in 1969. 

The words to be analyzed were taken in 200 word samples from the 
selected printed materials until a total of 5,083,721 tokens had been 
3 

Compare with Dewey's (I923) finding of I9.6. 



amassed. The types In this corpus were determined to be 86,7'*l. The words, 
after computer process I nq, were displayed In tv/o types of output, only the 
second of which appears In the Word Frequency Book. These were: cltatlons-- 
occurrcnces of types extracted In sufficient context to provide for analysis 
for definitional purposes--and descriptive stati stl cs--f requency of occurrence 
and distribution. 

The Herdan/Carrol 1 loflnormal model was used for computations. Results 
are tabulated alphabetically Indicating total frequency, frequency of 
occurrence by grade level and subject, and an index of distribution (range). 
Unlike many ether objective frequency counts, this book includes proper 
names, place names, and numbers. Results are also tabulated In frequency 
rank lists and freqeuncy grouped distribution lists, by total, grade level 
and category of material. 

This is an excellent current frequency count of the printed vocabu- 
lary to which primary grade and junior high school children are exposed. 
Its source material Is wide and representative and Its corpus ample (five 
ml I lion words) . It would have been more helpful to the teacher If a sem- 
antic frequency count had been included. Hot/ever, the material on which 
such a count could have been made is available and Is being used in the 
revision of the American Heritage School Dictionary. Hopefully, a semantic 
count wi 1 1 fol low. 
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Summary 



The science of objective or direct word counting has come a long 
way since the Kaeding Count of I838. Oral counts such as those of Pfeffer 
vl96'». 1965, and I968) on Basic Spoken German and Jones and Wepman on 
U.S. English (I966), are n<M on a par with comparable ones of printed/written 
languages, as exemplified by Kucera and Francis (19^7) for adults and 
Carroll, Davles. and RIchman for children (1971). all of which make extensive 
use of computer compilation and analysis, 

Bongers stated that a corpus of at least one million words Is required 
for a valid objective frequency count (Bongers. 19^7. page 2^0). Even with 
the aid of computers, such a corpus Is only laboriously obtained, manipu- 
lated, and analyzed. No matter how objective It may be. such a count is 
always subjective to the extent that someone must select the materials from 
which samples will be taken, and decide on the size of samples and their 
method of selection, even if the materials are chosen as a result of con- 
sensus of replies to a questionnaire. 

A possible alternative to the so-called objective word frequency 
counts has been suggested by Bernard Shapiro In his doctoral thesis entitled: 
"The Subjective Scaling of Relative Word Frequency" (I967). Or. Shapiro 
determined experimentally that relative word frequencies are a prothetlc 
psychological-addi tlve variable (as are other linguistic I terns) and that 
they are best subjectively measured by the "magnitude es' iiatlcn" technique 
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which follows the Steven's (power) Law. Further studies. If thfe/ verify 
Shapiro's work, may permit the determination of relative word frequencies 
and the development of relative frequency lists by subjective means and 
their conversion to objective word lists by means of mathematical formulae, 
tables, and graphs, thus saving much time, effort, and expense. 

The statistical sampling techniques used In the latest Japanese 
counts also deserve further study in an effort to ensure representative 
sampling In an economical manner. 
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HISTORY AMP DISCUSSION 
SECTION II 



OF WORD FREQUENCY 
- DISCUSSION 



COUNTS 



Purposes of V/ord Counts 

We have seen that word frequency counts have been made In many lan- 
guanes and for many purposes related to teaching and learning; such as 
stenoqraphy, spelling, vocabulary building for graded readers and for de- 
terninlnfi the essentials of oral vocabulary. They have been made for pur- 
poses of psychological research. They have been made on the words used 
both by children and on those used by adults. 

Generally, hov-^ever, the Intent has been to simplify Instruction and 
to economize on time and effort by concentrating on relevant and appropriate 
materials at successive levels of education whether for the written or 
oral natural language or some shorthand representation of It. 

Active and Passive Vocabularies 

It has been determined that there are differences of 'earning levels 
to be achieved even within school grades. For speaking or writing an 
active kno^vledoe Is required; In spe 1 1 Ing which requires recalling and 
writing the word In the right combinations of letters, and In talking which 
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requires being able to recall and pronounce the word In an understandable 
manner with due attention to accent, stress, and Intonation. On the other 
hand, for reading and listening what is required Is less complex, namely 
the visual or aural reconnltion of the word and determination of Its meaning 
In the context of the other words with which It Is found. This Is not 
as simple as It appears, since It also Involves recognition of typical 
sentence patterns of the language Involved, but Is nevertheless a lesser 
skill than having to recall and use the words and structures Involved as 
one does In speaking or writing. Active vocabularies. I.e., those used 
for speaking or writing are referred to by various authorities as "produc- 
tion or expression" vocabularies. Passive vocabularies. I.e., those used 
for listening or reading are referred to as "recognition, reception, or 
comprehension", vocabularies. 

Items Counted 

There are various definitions of the lexical unit to be counted, but 
In the end, the use of the dictionary word unit appears most efficient, 
even though in the learning process prefixes, suffixes, inflections, deri- 
vatives, and Idiomatic expressions must be considered as well as shifts 
fron basic meaning. Vakar defines a word as "every combination of letters 
with blank spaces on both sides" (1966, Vol. I, page 11). 

The first word counts tended to be of the printed or written word, 
principally the former. The Idea was, and still Is, that to teach or learn 
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efficiently, vocabulary must be built up in a natural and effective way 
so that the lexical units which are used most are learned first along with 
the grammar required to facilitate understanding of the basic structures 
within which the lexical units appear. 

Subjective Discussion in "Objective" Word Count s 

The early counts of the printed word were the so-called direct or 
objective frequency counts resulting In Wordboolts, Word Frequency Books, 
or Frequency Dictionaries. These counts, although called objective because 
words were counted and frequencies tabulated, involved numerous subjective 
decisions which actually made them hybrid objective-subjective counts rather 
than purely objective ones. Some of the subjective problems which had to be 
resolved v/ere : 

1. What is the purpose of the count? 

2. What Is to be counted? 

3. What is to be recorded? 

Are homonyms to be counted as one word or as separate words? 

5. Are meanings of words other than homonyms, i.e., the semantic 
subf requencies, to be considered? 

6. How many items are to be counted? 

7. How wide a range of categories or material and sources within 
categories have to be sampled to satisfy the purpose of the 
count? 
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3. What should be the time frame of the materials used as sources; 

i.e., only current materials, or those extending back to a 

specific date In the past? 
9. How will the sampling be done? 

10. How many words will be Included In each sample? 

11. Is frequency to be the only criterion for assigning a rank order 
of Importance to the words determined to be In frequent use? 

12 „ How many and what kinds of I terns are to be Included in the final 

list culled from among all the Items collected? 
13. In what formats are the results to be displayed? 

Purpost; 



The fir-* question, the consideration of purpose, sets the stage for 
answering all the othei-s. Ha/ever, all of them do not automatically follow 
from the purpose, since altsrnative approaches are open. Purposes have been 
discussed above and many of the more common ones are listed In Appendix 1. 

What is to be counted and recorded? This question generally resolves 
Itself Into two phases. Initially, everything In the sample is recorded. 
In the better counts, each word is preserved in context for use In determining 
variations of meaning. For the purposes of word counting, h»/ever. a decision 
has to be made as to whether to record oral markers, punctuation, exclamations, 
false starts and repetitions In oral language, obscene words, coined words, 
ungrammatical utterances, dialectical items, proper names, place names, 
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and numbers. The tendency has been to anlt markers, punctuation, place, 
proper nanes, numbers, false starts and repetitions In oral language and 
to drop vulgar or obscene words. On the other hand, the tendency has 
been to convert coined words, ungrammatlcal utterances, and dialectical 
Items to standard English equivalents. Again, all depends on the purpose. 
If the purpose Is purely to study what Is being written or said, then 
everything can be Included, since language Is what Is being said and wrltte 
and not what someone thinks It should be. On the other hand. If the final 
object Is teaching of children, there Is little sense In preserving Inmoral 
or Mllte/-ate expressions for their edification. There has been a long 
standing tendency to onl t place and proper names, numbers, and perhaps 
days of the week from word lists early In the compilation. The basis 
has been that general v»ord lists are desired, and that these are either 
specific—as proper and place names--or so common--as numbers--that they 
do not belong In the word list. However, at least one modern vocabulary 
based on a word count, "Fundamental French, 1st Level" Includes numbers, 
days of the week, months of the year, seasons, and measurements as special 
appendices to the vocabulary on the basis that everyone must use them 
at some time or other, even If frequency of usage In general conversation 
or wrt ting is ]cm. 

Homonyms and Headwords 

There has been a general tendency, until recently, to record homonym 
as one word rather than to separate them on the basis of meaning. While 
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this may have been desirable In the earlier counts designed to develop 
word lists for the teaching of stenography, it It definitely contra-Indicated 
in a study of language, as such. There has also been a tendency to suppress 
forns and to show In the word lists only the headwo -d with a frequency equal 
to the total frequencies of all Its Inflections (declensions and conjuga- 
tions), derivatives, and unhyphenated compounds. This system has the ad- 
vantage of keeping the list of basic words short while indicating the fre- 
quency with which the basic form of the word appears. However, a better 
practice Is to show the headword and then to Indent under It Its derivatives 
and unhyphenated compounds in a word family group, with the total family 
frequency listed for the headword, and Individual frequencies listed for 
derivatives. A similar problem arises with respect to singulars and pU-rals. 
The singular is usually the headword, but often the frequency of the plural 
is included only in the headword and the plural is never shown In its 
ow« right. If we are dealinn with word? only as simple concepts or Ideas, 
this may have some idtlonale, but if we are also interested in how the 
word is used; I.e., vihethcr only or most?y in the singular or plural, 
subsuning the frequencies under the headword tells the student or teacher 
only tne gross usages of the concept, not the form or forms in which It 
appears. It would appear best to use the form with the greatest frequency 
as headv/ord to lnJ«nt the plural (or singular) under It Indicating 

its part of the total frequency. 
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A similar, but more Important problem arises with respect to meaning. 
The whole object of language Is to convey meaning. Tci divorce meaning 
from frequency In lists Intended tor use In teaching language Is senseless. 
Even Thorndlke with the aid of Lorge was finally convinced of this In the 
I930's, yet we stlil find frequency lists caning out with the important 
element of mcaninn and semantic frequancy omitted. Admittedly, the addition 
of meaning complicates and lengthens a word list, but it Is essentially a 
part of the word family group of headword, derivatives, unhyphenated com- 
pounds, and perhaps Inflections. If a word has two or more different meanings 
which are difficult for the beginner to Infer from each other, merely listing 
the word and its frequency does not help very much, particularly In speaking 
and writing. There should be subllstlngs indicating the contributing fre- 
quency or percentage of total frequency of each or the Important meanings 
of a word. Determining what Is Important calls for another subjective 
decision, but, in general, meanings contributing 10 percent or more of the 
total word frequency should be Included. The teacher then has the option 
of grading his or her materials by teaching Initially only meanings amounting 
to, for example, 75 percent of the total frequency of the word. Without 
an Indication of semantic frequency, the teacher Is left to his or her 
own experience to determine what meanings of the word should be taught 
and in what order. 
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Closely related to semantic analysis from a teaching point of 
view is grammatical analysis. A good count should also show the frequency 
(or percentane) of total use which each grammatical use of the word contri- 
butes since some words function as two or more parts of speech and It 
is important to a teacher to know whether the word is most used as an 
adjective, noun or other part of speech. 

The problem of whether to list Inflections Is more difficult. 
Most languages conjugate their vsrbs and some decline their nouns and 
adjectives. Most also compare adjectives and adverbs. A list which class- 
ifies all Inflections could be very cumbersome, although Instructive. 
At least for a language which does not decline Its nouns except for plurals 
and does not assign a gender to most nouns, the problem Is largely one 
of the advisability of recording and listing the frequency of the conjugations 
of its verbs. Certainly such voluminous material ought not to be In the 
lists proper, but puttinq them In appendices would be appropriate as an 
aid to the teacher In determining the grading and order of teaching (if 
worth teaching at all) of the several tenses of verbs, and within tenses 
the "persons" which are Important. It appears that such a listing of 
verbal conjugations would prove an Important economy measure both from 
the teacher's and the student's point of view. If a verbal form Is to 
be used In writing or speaking only once In a million times, there Is 
llttie use of teaching or learning It for either active or passive uses 
except for those who are to become experts In the language; I.e., translators, 
interpreters or teachers. Without such a list, the chances are that time 
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and minds will be occupied with much excess linguistic baggage to the 
exclusion of much more Important matters. 

Quantity to be Counted 

On the problem of how many words should be counted, there Is empir- 
ical evidence ranging from Vakar's Spoken Russian Word Count with a corpus 
of 10,000 randomly selected running words with 93 sources of one category, 
(1966) through Eldrldges's newspaper count of 1911 with ^♦0,000 running 
words (Bongers, I9A7, page 33) to Thorndike and Lorge's "Teachers' Word- 
book of 30,000 Words" (19^^) whic.i was based on a combined total of about 
23,500,000 running words. Mackey argues that statistically, the greater 
the number of I terns counted, the greater the reliability of the counts 
(1967, page 179). Bongers has repeatedly stated that counts of less than 
one million running words are of little value (19^7. page 2A0) , and Kell 
(1965) says that the corpus should contain at least ten million running 
words. Yet many apparently excellent recent counts have far fewer than 
one million words: e.g.. Fundamental French (1st Level) (602,000 total 
running words with only 312,315 spoken words) (French Ministry of National 
Education, 1959). Vakar in defending his small 10,000 word corpus stated 
it was derived from a population of more than one million running words 
and that "properly conducted random or sequential sampling makes larger 
word counts wasteful— for after all, the commonest words must be common 
enough to recur in any text of reasonable length." (I966, Vol. I, page 
10). 
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On the theory that numbers Improve reliability, many researchers-- 
after making their own studies— have added In the results of prior research 
to supplement or broaden their own. This, In effect. Increase* both the 
running words and the number of sources of the composite study. Statisti- 
cally, this type of additive research should have Increased the validity 
of the final results. However, It must be remembered that the quality of 
the final study can hardly be greater than the average quality of all Inputs 
in spite of the greater number of words and sources. 

Categories and Sources 

With respect to range, the question of categories (Mackey uses the 
term "Registers") of material as well as the selection of sources within 
categories arises. Special technical counts can be restricted to one 
category, but a count designed to yield a general vocabulary, especially 
an active one, must sample a wide range of categories and sources within 
those categories. Choice of colloquial as opposed to literary style, 
differences of author style, differences in dialect, and differences of 
period In which the source was written all affect the occurrence and usage 
of words. For a good general vocabulary, a wide and current variety of 
oral as well as printed and written material must be Included with a view 
to deriving words for both active (productive for wri ti ng/speaklng) and 
passive (recogi., tlon and reception for listening/reading) vocabularies. 



67 



ERIC 



The Spanish Voca bular y Count of the University of Puerto Rico 
(Rodriguez Bou, 1952) also included a category of subjectively selected 
printed sources based on supplemental texts used In the school systems. 
This is an Indication of the techniques which researchers use In order 
to ensure that their final vocabulary Is representative for the uses to 
which they expect It will be put: In other words to ensure that the range 
of their study Is adequate for Its purpose. 

Time Period 

The time period In which a source was written will also affect 
the vocabulary derived fron it. Frequency of usage of Individual words 
.md their meanings clearly change over time. Here again, the purpose 
is important In determining range, Josselyn In his word count of printed 
Russian (1953) could well go back to the mld-l800's since he was Interested 
In a vocabulary to assist readers of lUerary Russian. However, anyone 
interested In colloquial oral Russian would use recent sources as Vakar 
did in taking samples from 200 acts In 93 plays published In 1957 or after. 
Berger In his studies has found evidence that conversational English may 
vary considerably over relatively short spans of time and space 0967. 
page 20) . 



Source selection was one of the biggest criticisms of Thorndlke's 
"Teachers' Word Docks" since In Its 1921 edition It leaned so heavily (75 
percent of four nllllon words) on the DIhle and literary works. As succes- 
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sive editions appeared and additional material was dddcd this situation 
improved as the biblical/literary sources became diluted In the huge cor- 
pus. Nevertheless that part of the original corpus (about 25 percent) 
which was current in 1921 was already at least 23 years old when the mk 
edition appeared. This conbination of the already old and that which 
becauc old before the edition resulted in making at least 15 
to 20 percent (four to five million running words) of the final corpus 
out of date even in lOAA. 

To overcome the problem of the effects of selection of individual 
sources on word frequencies, as large a number of categories with as great 
a variety of sources within categories as possible is desirable within the 
bounds of manageability and diminishing returns. 

Sampling and Sample Siz e 

The nunber of ways in which samples can be collected from sources is 
almost Infinite. To reflect the structure of the language and word meaning, 
however, if the material is taicen from printed or written material, It 
should seldom be less than sentence length and paragraph length might 
even be better, regardless of whether th. material is taken sequentially, 
randomly, or according to some other predetermined pattern until the re- 
quired number of running words has been taken from each source. 
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Sampling of oral language can be done in several ways. In any 
TOthod, spontaneity Is desired. Cues In the form of words or subject 
areas are used to evoke the flow of words. The number of words talcen 
from each informant depends on the size of the corpus desired and the 
number of informants. It is important that categories of subjects to 
be talked about and stimulus words or pictures used to elicit responses 
of either connected discourse or Individual words be selected In advance 
In order to ensure thorough coverage of the several aspects of dally life. 
Since each person speaks at his own rate It Is difficult to determine 
hav long the Informant has to talk to provide his quota of words. However, 
the common research practice has been to record discourse by tape recorder 
for periods running from 3 to 12 minutes. 

If discourse on a subject area Is desired, the Informant Is asked 
to talk, for example, about his Job, his family, his home, the furniture 
in his home, his hobbles, or sports, as desired. »f the speaker slows 
down because he Is running out of subject matter or Ideas, he may be prompted 
by asking leading questions so that he will touch on aspects of the subject 
he has overlooked. 



Another method of evoking spontaneous speech has been to use the 
Hurray Thematic Apperception Test of 19^3 or similar device. Informants 
are asked to talk about the pictures which are the basis of the test, 
being careful to match the picture with the sex of the Informant In order 
to obtain subject related words from Individuals most likely to be well 
acquainted with the subject. 
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Another method Is that of association; either free or controlled. 
In free association, the Informant Is asked to write down everything that 
comes into his mind during a specified time period, e.g.. five minutes. 
In controlled association, the Informant Is given a stimulus word and 

asked to write down words, e.g.. all the nouns, verbs, and adjectives. 

which that stimulus suggests to him In a specified period of time. Free 

association has been used at least since 1936 when Buckingham and Dolch 

published "A Combined V/ord List". 

Still another method, useful with school age children for written 
counts, Is to examine written compositions at various grade levels on 
various subjects. 



Variations and combinations of these techniques have been used widely 
in the past 20 years In English. Spanish, French, and German word counts, 
and have yielded, particularly for oral word counts, as useful results 
as extensive sampling has for the written or printed counts of the past. 

The stimulus word:*, related to subject area, for example, have been 
quite useful In discovering the so-called concrete, topical, or utility 
words. These are commonly nouns cr adjectives. They are often single 
meaning words which relate only to specific things and are not likely to 
occur in general frequency counts unless the count has a very large corpus 
which happens to contain a contribution from a source related to the subject 
In which the concrete word is likely to appear. Many of these words in 
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a general count would have a frequency of only one In several million and 
a very restricted range, yet they are absolutely essential If one wants to 
talk about the specific subject to which they are related. Many of the 
frequency counts were meant to produce vocabulary for teaching purposes, 
but failed to produce wo-ds such as chalk or chalk (black) board which are 
intimately related to the classroom. Until recently, the only way to 
obtain such essential words was to compile a very large corpus or to use 
subject or Interest area convesatlonal topics, controlled associative 
techniques or both. Recently, however, Richards (1970) has recommended 
that a system called "Word Familiarity" which Is a subjective rating of a 
list of words according to the relative frequency with which Informants 
believe they expect to encounter the words. 

Relative Importance of Words 

Once the different words (types) have been selected from the running 
words (tokens), and the frequency count completed, the question arises 
as to the relative Importance of the words which have been Isolated. On 
this depends the order of consideration and presentation If one Is preparing 
a textbook for teaching. Originally, the raw frequency was taken as the 
indication of the relative Importance of the word; the higher the frequency 
the more Important the word. This appeared to be true for teachers and 
students of stenography and spelling for whom many of the early counts 
were made. Ha./ever, the criterion of frequency by Itself became suspect 
when studies were made of the frequency of the words In each of the sources 
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which contributed to the total frequency. It was discovered that some 

words were fairly evenly distributed among sources and thus truly general 

and useful regardless of subject matter. Such words Included, but were 

not restricted to, the so-called structurals or functors. A problem In 

relative word Importance became apparent, however, when it was discovered 

that a word with a high total frequency might be derived from a single 

source or small number of sources out of all the sources which contributed 

to the total corpus. Unevenness of frequency distribution across sources 

is generally an indication that the word is somehow/ specific to one or 

a few subjects and Is not encountered generally. (Bongers has labeled such 

types "environmental" words.) The problem of relative Importance or rank 

order of words on vocabulary lists was thus broadened to the question 

of what cones first; total frequency or range (number of sources In which 

the word appears with consideration of the frequency In each source). 

Some early researchers opted for frequency, some for range, and many others 

adopted various objective and subjective formulas for combining the two 

In determining word Importance. Most continued to use frequency as basic 

but used a method such as dropping words that occurred only in one or 

two sources as not being representative enough for consideration, at least 

in basic or beginning vocabularies. Bongers has held that trying to correlate 

frequency and range Is an Impossible task, since after about 1200 words 

which are common to most subjects, the words obtained In any word count 

are so highly dependent on selection of sources that no meaningful permanent 

relationship between frequency and range can be derived. 
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Oongers concluded that the best method to date of dealing with 
the frequency/ranoe problem is that developed by de la Court, while working 
in Indonesia. Oe la Court's system was briefly this: His total corpus 
was one million. When he had counted 500,000 words, he totalled the fre- 
quencies of each individual word. He similarly totalled the frequencies 
of each Individual word In the second half of the corpus. Then he compared 
the frequencies for each word In each half of the count. If they were 
far out of balance, e.g., had a ratio of 1/10 or less, he dropped the 
word as beinti too specific for a general list. He dropped 26 words for 
this reason alone. Vender Oeke In his French Word List dro/jped any word 
not appearing In at least half of his sources, thus eliminating many concrete 
nouns. 



Ernest Horn In his Basic Writing Vocabulary (1926) argued that there 
are two measures of importance In judging subject matter for Inclusion 
In lists: frequency, and value attached to each occasion when material 
Is needed or used. The value attached to each occasion according to Horn 
was on Indication of tvx) types of range; geographical and across writing 
samples. In effect, Horn obtained range estimates based on number of 
types of correspondence In which a word was found, the numbers of writers 
who used It, and also where the writers lived. Dongers did not disagree 
on distribution as it relates to use in the sources employed by Horn but 
he believed that Horn's v^orry about geographical distribution was a sampling 
problem which could have been handled by Judicious selection of sources 
and Increase in their number. 
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Faucctt and MakI (1932) In "A Study of English Wc/d Values Statis- 
tically Determined from the Latest Extensive Word Counts," tried a different 
system In an effort to combine the range and frequency ratings In the 
Thorndike and Horn Lists. They placed all the words on a scale from 1-- 
the most valuable words to 120— the least valuable, on the basis that 
words of value I had the widest range and greatest frequency and 120 the 
least of both. Intermediate groupings were defined as: 1 to 9: Indl spenslble ; 
10 to Essential; 35 to 80: Useful; and 81 to 120: Special. One problem 
that they encountered arose from the purpose of the horn list. Horn 
dropped all words spelled with three letters or less, thus suppressing 
a large group of short words. This obvious attempt to deal with the dif- 
ficult problem of resolving frequency and range failed because Thorndike 
and Horn didn't use the same type of frequency ratings and did not even 
agree on their definitions of a word. 

More recent studies Indicate that rel-vtlve word importance, at 
least for language teaching purposes, depends on factors other than 
frequency and range. These factors Include aval lab! 1 1 tv . coverage , arid 
learnabi 1 Ity. We have already discussed "availability," otherwise known as 
^ "utility", when we discussed the concrete, subject-oriented nouns and 

adjectives. .However, verbs too can be situational or specific. For example; 
lists made of the ten most frequent and the ten nwst available French 
verbs revealed only one word common to both lists, and that was "aller" 
(Mackey, 1967), which may be glossed as "to go" but also "to be going 
to" or "about to do" sonethlnn. It was fifth on the most frequent list 
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and seventh on the most available list. The only reason that It was on 
both lists was that as & "content" verb--to go--lt was used frequently, 
and as a structural (auxilllary) verb It Is popular as a present progressive 
auxiliary Indicating an action to be taken In the near future, in English, 
"go", "got", and "have" have the same duality of function. 

With respect to "coverage", the Importance of a word depends on Its 
ability to replace the greatest number of others with which It Is wholly or par- 
tially synonomous. In a basic learning vocabulary, It Is preferable to 
Include one word that can be used to serve for six others. In this way, 
the learning effort Is reduced to about 10 to 15 percent of what It 
would be If all seven words had to be learned. Cove rage can be broken 
dcwn into four subdivisions: Inclusion - a general word which includes 
the meaning of several more specific ones is to be preferred to a word 
with only one or two meanings; Extension - a word with many full or partial 
synonyms is preferred to one with few; Combination - a simple word which 
can be used to combine with others in compounds which replace other Individual 
words is preferred to a word which does not combine often in general usage, 
and Definition - words that are most useful In def lning--and, therefore, 
substituting for others is to be preferred to one which is of little use 
in explaining other words. Michael West explores this property of words 
•n his Definition Vocabulary (1955) and Ogden in the development of his 
Oaslc English (Graham, 1968). 



76 




J. G. Savard In his book entitled La Valence Lexicale (1970) 
disGusse.1 the use of word coverage as an alternative to frequency In de- 
termining the relative value of words. He found that the correlation 
between word coverage and frequency is weak. He believed that they are 
two very different principles related to word value, and that word cover- 
age was no less valid a measure than frequency. He recomnended that 
studies be continued to try to find correlations between frequency, range 
(distribution), availability (utility), and word coverage with Its four 
constituents as listed above. He believed that word coverage represents 
a new variable which should be considered in determining the assistance a 
select vocabulary can render to language learning. He argues that it, 
or sonething like It, is needed to supplement frequency, range, and avail- 
ability in determining the relative value of words constituting limited 
vocabularies. Word coverage, in his opinion, is a measure of verbal 
econony and will have useful side applications in the development of dic- 
tionaries, glossaries, and thcsauruscs. 



A word is important from the point of view of "learnabi II ty", which 
is an avAward way of saying It is easier to learn than other words. Log- 
ically, "learnabi li ty" may be considered to be a function of "similarity", 
"clarity", "brevity", "regularity", and "learning load" or "burden". SIml- 
JarU^ generally occurs because words are cognates in the two languages 
concerned; they are generally orthographical ly and referent lal ly similar. 
Thii is not always true, however, because words may be more inclusive 
In one languaoe than another and their frequency of usa^e will likewise 
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be different. CJarJjty^ Is usually found In concrete as opposed to abstract 
words. Brev I ty is a function of being spelled with a few letters and 
being easily pronounced. Regularity Is a function of following grammatical 
rules, such as regular plurals or regular conjugations. Regular words 
are preferable to irregular ones which require more learning effort, 
learning load Is Inherent in the preceding four aspects of learnabi 1 1 ty ; 
words selected on the four preceding criteria will normally be easier 
to learn. Unfortunately, words which are learned easily may not be the 
most useful . 



Swonson and West in their study "On Counting of New Words" (IDS'*), 
included "A Set of Rdting Scales" which are comprehensive in their enuner- 
ation of the gradations of difficulty of learning Idioms, cognates, compounds, 
spelling variations and semantic shifts of words. Because of their special 
interest, these scales arc reproduced in full in the following displays. 

Ernest Horn in his Basic V/rltl ng Vocabulary (1926) also considered 
spelling difficulty as an Input to word importance, but his purpose was 
quite different. Althounh the writers of basic vocabularies are looking 
for balanced v/ord lists thoy want their words to he as simple as possible 
and still permit expression at an adequate level of language proficiency. 
This motivation leads then to delete a word that Is in some way difficult 
if a simpler substitute is available. Horn, on the other hand, in trying 
to help teachers with their job of getting pupils to learn to spell well, 
used the greater degree of difficulty as a rationale for Including a word 
in his list. 
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Scale I» A Rating Scale for Cliaiij^os of Meaning 



V 

2 



0 A word already learned, and now used in exactly the sanie 
meaning. 



\ 



*;3 ! 2 A change of meaning just perceptible to the teacher; query 



8- 



0^. 
V 



3* 



6 



7 



c 



A- 



li 



13 



worth (K)inting out? 



4 A change of meaning so slight that it would not be noticed in 
reading - but it might be pointed out in speech. 



G A change of meaning which would be noticed in reading and 
would cause a moment's hesitation* 



%2i ^ 8 A change of meaning which would lie noticed in reading and 

^ j may cause considerable hesitation, but the meaning will be 

^ I inferred by all the pupils eventually. 

\ 9_ 

10 The average child might jnsl—vr just not -be able to 
J I guess the meaning in reading. roiNT 



12 The new meaning would probably not be guessed in reading 
but is easily grasped when explained, and is easy to explain. 



15 


U 


The new meaning could nor possibly be guessed in reading but 
can be explained -with medium diflTiculty* 


J' 


la 


ll is (lillicult to show the connection between the old and the 
new meaning; the connection is barely perceptible. 




17 

19 


18 


Almost a homonym. The old meaning can barely be twisted 
into the new one. 




20 


An entirely new word of average dillkulty which can readily 
be translated into the mother-tongue by one (or two) equiva- 
lent words. 



21 New words which are less readily translatable. 

30 An untranslatable word; it needs a lecture to explain it. 
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EXAMPLES OF RATINGS ON SCALE L 



1. The discovery of America: the 
discovery of X Rays. The middle 
of the room: the middle of the 
night. 

2. A loud cry: a loud voice. An old 
man: an old building. 

3. One minute to six: wait a minute. 
High up the hill: high up in his 
profession. 

4. A bad l>oy: a bad egg. >1ollo\v 
(adj): in the hollow of a tree. 

6. The sun is in the sky: ^tandin;; 
in the sun. Trees growing about 
the house: there were no people 
about in the streets. 

0. A debt of $1000: t am in debt to 
him for his help. To ftght against 
the enemy: I am against any 
change in the law. 

7. Peter loves Jane: I love sausages. 
He has a weak heart: & kind 
heart. 

8. Of equal site: my equals and 
my betters. Neck of a man: 
neck of a bottle. 



0. 



10. 



11. 



12. 
13. 



14. 
16. 
IC. 
17. 

18. 
19. 
20. 



To touch with fi nger a : leaves 
touched with gold. The eye: to 
eye. 

Form ( " shape) : form of pro* 
ceedtngs. 

Ifollow: near the wood there is 

a beautiful hollow. . Match 

(marriage): to match colours. 

Hand: he writes a good hand. 

After hearing what he said: 

her hearing is bad. They had 

an argument: that is a strong 

argument for. 

To touch: touching. 

A room: room. Air: (--manner) 

Arch: archer. 

If he docs I shall . . go and see 

if she is ready. 

A meal: meal. 

Arm (pan of body): arms. 

A match (light): match (mar* 

yiuge). 
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Sculo II. A Rating Sciilo for Idioms 



A. 
V 

2 



0 The idiuin offers no irace of diriiculiy (ami is exactly the same 
as thai of the inother*(onguc}. Would not be noticed as an 
idiom. 



1 



2 A very obvious and self explanatory idiom (almost the same 
g'\ ^^^^^ of the motheMongue). Might not be noticed as an 

^ * idiom. 



13 



6 



4 Tlic meaning is very clear, hne not qiiile obvious. Would 
probably be noticed as an idiom. 



c 



C Would he noticed as an idiom, and would cause a moment's 
hesitation. 



8 Would cause considerable lie:>iution, but the meaning would 
eventually be guessed by all. 



II 



10 'i'hc meaning of the idiom might, or might not, be guessed j^im 
by the average pupil. POINT 



V 

JQ 

a 



II 



13 



12 The meaning of the idiom would probably not be guessed by 
the average pupii, but is very easily explained. 



15 



14 The meaning coidfl woi possibly be guesse^l by any pupil, but 
can be explained v%ith medium difficulty. 



16 The meaning is jusr perceptible in the words when they are 
explained,- but is difbcult to explain* 

17 



/V. . 

i1 VI9 



18 The words can just barely be twisted into the meaning. 

$ 1 20 The idiom is quite unexplainable: the whole idiom has to be 
» t augh t as one word. 

Idioms in which th< 
e.g. such as mean Hi 
into the molher'tongue. 



21 — 30 Idioms in wliich the pupil is especially liable to go wrong, 
e.g. such as mean something different if translated literally 

Iltltfl llltf* Itl#\l tlA» . I VMi^ttA 



8] 



EXAMINES OF RATINGS ON SCALE II 



Put lo death before the eyes of 
his friends. War came to an end. 
That's new to mc. Nothing on 
earth would « . . 

A button has come off my coat. 
The building was in flames. 
Much in request as a singer. 
Work on hand. 

Outside the field of his interest. 
Give mc a hand with this box. 
To go there on foot, 
Between now and then. Keep to 
the rules. 

Can have anything in reason. 
To come into bne. 
I can't put n)y hand on the paper 
1 want. Worked by electricity. 



12. Go wrong. I sent for the doctor. 

13. Me turned up his trousers. Sia 
years old. 

14* He came in person. A run on the 
bank. 

16. Go on singing. I look forward to 
the party. 

10. Far nicer. To make money (by 
selling books). 

17. To put up a good fight. Set out. 

18. In good order. In order to. He 
let me down. 

10. Said it with his tongue in his 
check. 

20* So long! Egged on. 
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Scale IIL A Katiiii;t Senlu for Co^^iuitcs 

(Weight for xptUing^pfonOHncitUion %$ to he added io eoinates^ 
when nmtinry. Wv^ds cannot A* fated as ^oinutes unless the 
native wo^d is kno^on to the pupit). 



^ 0 Sucli perfect ideiuity of form and meaning that the word is not 

noticed by the pupil as being nnw. 



■h 

o 



V 



2 There is a just pcrcpptiblc difrcrence of (oriii and/or meaning 
— but the sense is very obvious, 



4 A fairly obviuiis rdatitinsliip; the change of form ind/or 
iiicanin(; will be readily nndcrstoofi. 



6 Such difTcrcnce of form and/or meaning as will cause a 
niomeni's hesiiaiion, 



8 Such (lifTcrencc of form and/or meaning as may cause con« 
sidcrahle hesitation, but the cognate will eveiilually be 
identified and imcrpreted by all. 



10 A coj-natc v\hich ju«t might— or might not - be ii^cntified j^m, 
and interpreted by the average pupil. POINT 



11 



rv. 

"5. 



11 



13 



15 



12 The cognate would probably not be identified or interpreted 
by the average pupil, but will be very readily grasped when 
pointed out. 



14 The cognaic could not f)osHibly be idenrificd or interpreted 
by any pupil, but the relationship of the foreign to the nat*ve 
word can be explained uith medium difPculty. 



10 The relationship of the foreign .'*nd native word is only just 
perceptible and is very dilfKult to explain, but the cognate is 
probably helpful. 

17 
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EXAMPLES OF RATINGS ON SCALE III 



0. 


Courage— Courage 


10. 


Vous avet raison—Kcason ablc. 


1. 


dnotion—nmolion 


11. 


Parlie— Party 


2. 


Un toasi — Toast 


12. 


Se dresser— Right dress! 


3. 


l-'lanc (d'unc montaciic) -Mank 


13. 


Parci-ile— f*arcel 


4. 


A bord— Aboard 


14. 


Rude— Rude 


6. 


Conjptcr— To count 


15. 


Trouble -Trouble 


6. 


Parent — Parent 


10. 


SpiritucI— Spiritual 


7. 


liravourc - flravery 


i7. 


Pavilion— Pavilion 


8. 


So inof|uer tic - Mock at 


18. 


Figure— Figure 


0. 


Cave— Cave 


10. 


Callanlcric— Gallantry 



20. Adress^ -Addrcss (on a letter) 
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Scale IV. A Ifiatin^ Scale for Compouiiils of 

Known lillcmciits 



N,B* The Prefixes and Sujfi^et an tcored as new wofd$ on Ihiif 
fini ouHntn<*; ihii scaU ftjtn io tubiequenl tompoundt only. 



, A. 



6 



U 



1 



0 AclcJitimi of an absolutely invariahle 1* or S which never 
causes any change of (orni or meaning. 



2 AdcHtiun of a regular !• or S wilh such change of form or 
meaning as would harclly he noticed in reading. 



4 Adtliiion of a P or S wilh such slight change of form or meaning 
as will probably be noticed— but it will cause no dilPcuity in 
reading (but must be pointed out for speech). 



Addition of a P or S wiih such change of form or meaning as 
will cause a moment's hesitation in reading. 



The change of form or meaning may cause considerable hesita- 
tion, bul the word will beidentiAed and interpreted by ait the 
pupils eventually. 



II 



10 The average pupil may just —or just not -be able to 
identify and interpret the compound in reading. 



Min 



MM 

"a 

X 



11 



13 



15 



17 



12 The compound would not be identified or interpreted in 
reafling, but is easily grasped when e.xplaincd, and is easy to 
explain. 



14 The com[wund coul'I not possibly be guessed in rearling but 
can be explained,— not easily, but without great difliculty. 



10 The mc.ining of the I* or S and origin.d root arc only just pcr- 
ceptililc in the compound, and the compound is not easy to 
explain. 
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17 



i3.'19 



A. 
V 



18 The incaniiiK of the P or Sand root ran just Imrcly lie Iwisled 
into Ihc tnerining of the conipounil. ti is iluuUful whether 
analysib is useful. 



21 



20 The V or S creates a new menning l)e.iring no relnlion either to 
the original menning or to the V or S. Or tlie V or S changes 
Its meaning so widely as to amount to o new V or S (rated as a 
new ward). 



30 Compauntls which cause special difTicully, e g. such as tend to 
be misused in the literal sense though the real meaning is far 
different. 



EXAMPLES or UATINGS ON SCALE IV 



1. Childhood. Uuitonholef 

2. Leader. Ijangerous 

3. A prefix. Tidiness 

4. Imprison. IloundlchS 
6. Transplant 

6. Nonsense 

7. Terrify Misadventure 

8. Wasteful 

0 Irrecoverable 

. Money *order 

Knot 'flowers^. 



11. Enlist 

12. Underling 

13. A fal:ieliood 

14. A two seal CI* Moreover 
16. Transact 

16. IVofiiecr 

17. Miblay 

18. Homely 

19. Anchorage. EngenJer 

20. WeilofI 
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Scale V* A Italiitji Scnic for Spolliiii^-PrbaouiieU 

atioii DiKcrcpniicy 



This scale is nol inUnded to assess ihe rehiive difftcttlly of Ike 
Jundnmrntal sounds of ihe tanjfuaie, btii mfffty smh lendtncy to 
mispfonouncf or missp;U as is iniuerd by a spAling which does not 
comspond to ihe aciuat sound of ihe word, 

{Ilat/ihis rating is to be added as extra weight to the rating of any 
new W3rd, any group of tetters or sounds not previously 
encountered. No word is to be rated twice for speltiug-prououncia* 
Hon, even if the first appearance of the word was in a totally 
different meafiing.) 

(1) The word is pronouneetl just as it is spelled, and spclliNi just as it is pronounced; 
no possibility of error. 

(2) A slight divergence which mny lead to error in pronunciation or in spelling. 

(3) A less easy or safe word, but ftlill below the average of those that give any 
trouble. 

(4) If the word were dictated to an average class, without previous experience of 
it, nearly half the pupils might misspell it; or, written on the blackboard, nearly 
half the pupils might misread it. 

(5) If the word were dictated to an average class» more than half the pupils would 
make a mistake. 

(6) A definitely troublesome word, but not among the worst. 

(7) The notorious trouble*giver8. 

(8) One of the must often quoted absurdities of Hngliith spelling; a word that 
almost all foreigners misspell -or, if they spell it right, they mispronounce it 
following the sptlling.— Also words in which the pupil tentis to be mi6led 
gravely by the spelling of a word in his mother-tongue. 



EXAUPI.ES 



1. 

2. 
3. 
4. 



This. Time. 
Blade. Dress. 
Shock. Roll. 



5. Science. Soap. 

C. neautifu). Doubt. 

7. Scythe. Touch. 

8. Cough, tough. 



Separate. Soup. 
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The selection of v. irds and format In a vocal^ulary list is not 
too difficult if all the five criteria (frequency, range, availability, 
coverage, and Icarnab i 1 i ty) arc in anrcemont. The problem arisci when 
the criteria arc in conflict on the selection of a v^ord. Any one of the 
criteria could be In conflict with all the others just as easily as It 
could be in aqrccnent with them. This means that In case of complete 
conflict sone 10 conflict areas have to be resolved. The final resolution 
depends larqely on the uses to which the list is to be put. For a con- 
fa ined use, such as both a speakinq and rcadlni^ vocabulary, Fries and 
Traver (1950) suggest order of precedence among criteria as follov>;s: fre- 
quency, coverage, range, availability, and learnabi 1 1 ty . 

The foregoing yardsticks for measuring importance or value by 
no means exhaust the list. As Intimated above, almost every word counter 
has had his ov/n system, either original or a modification of an earlier 
system used by another word counter. Perhaps the best list of considerations 
ot.ler than those discussed above Is to be found In the "Interim Report on Vocabu 
lary Selection" (Carnegie Report) of 1936 (Michael West £!£[.). It listed 
as possible criteria: frequency, structural value (functional types), 
universality over wide geographic area (like Horn); appHcability to a 
wide variety of subject matter (general use words); value for purposes 
of defining other words (West's Definition Vocabulary); value for word 
building (ability to conbine into compounds, discussed above); and stylistic 
function (use to express precise meanings and In conversation). 
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There roust be balance between comprehensiveness and usefulness. 
Thorndike, v/lth the aid of Lorge In the later stages, worked his original 
list of If), 000 words up to 20,000 and then to 30,000 words over a period 
of 23 years. The Thorn d I ^^^prd^Qook^ was certainly comprehensive but 
In reality not as useful as It appeared because of Its antiquated sources. 
Thorndike counted and borrowed over 23,000,000 words for his 30,000 word 
list. This tremendous corpus size and others like It have led Frumklna 
(196^) to propose an application of ZIpf's Law which would allow calculation 
of the corpus size required to provide a word list which will be statisti- 
cally valid dov^/n to a pre-selected frequency within a predetermined margin 
of error. Using the large corpora assembled by other Investigators, 
Frumklna attempted to estimate the population values for the frequencies 
of Individual word types within a specified Interval of the sample spaces, 
these estimates being based on the observed regularity of the frequency- 
rank relationship discovered by ZIpf. More recently, Carroll (1971) has 
used a log-normal function to estimate the population values of the word 
types from the data of the American Heritage V/ordbook . 

In contrast to the comprehensive word lists of the Thorndike var- 
iety are the so-called Basic Vocabularies for the teaching of foreign 
languages. These nave tended to run from 600 to 3000 words for a good 
four-year hi oh school language course. Ho^/ever, these numbers may be 
deceptive, depending on how "word" Is defined. If the dictionary rntry 
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is taken as the "word", but Is deemed to Include within It all Its Inflec- 
tional forms, derivatives, compounds, and the semantic variations of each, 
a list labelled as 1000 "words" may actually be effectively a 6000 
v/ord list as far as the learning effort required of the student Is concerned. 
The point Is that the short basic vocabulary, althouoh built up of word 
families with related forms and meanings, may give a false Impression 
of the effort required to learn the fundamentals of a foreign language. 
Given a reasonable balance between subsuming all forms under the headword 
and separate entries for each form, most researchers agree that 3000 to 
60f»0 words provide a good basic vocabulary. 

The location of the point of diminishing returns aj applied to fre- 
quency has been an object of controversy for several years. Ernest Horn 
(192G) in his Oasic Wr iti ng Vocabulary said, with respect to spelling, 
that after the first 1000 vv/ords, the addition of each group of 1000 words 
in a spelling list adds e very small percentage to the number of running 
words th.it one can spell. For example, the person who knows how to spell 
the ^4000 connonest words can add only a little more than one percent to 
the number of runninr; words he can spell by learning an additional 1000 
words, since the new words arc those of la/ frequency of occurrence. West 
sug<]ests that a gpneral vocabulary of 7000 words will enable a person 
to read most novels, and that for speaking, an individual needs a vocabulary 
of about 2300 general words. After reaching the 7000 and 2800 word limits, 
the person must start learning specialized vocabularies in his fleld(s) 
of interest. These figures Indicate that the 3000 to 5000 word vocabularies 
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are valuable, even though they may have to be graded Into smaller Increments 
for instructional purposes. 

Another problem is ha; to determine the words to be included In 
a useful readinq or speaking vocabulary of the siie cited by West, above. 
Most authorities, Includlnrj Ayres, Thorndike, and Ho.n have concluded 
that objective word counts must be extremely extensive In numbers of running 
words and must sample a very wide range of categories and sources to be 
accurate beyond the first 500 to 1500 words. Ogden and Palmer believed 
that word lists of equivalent length could be compiled subjectively with 
the sane accuracy. In objective counts, after the first 1500 words, the 
lists tend to reflect the subjectively chosen sources and categories, 
whether the count is made of oral, printed, or written language. Never- 
theless, Thorndike held that his 1921 list of 10,000 words was good enough 
for educational purposes through the first 5000 and that it was generally 
useful throughout Its full 10,000 words. 

Althouoh direct comparisons are impossible because of the different 
methods used anci definitions applied, most researchers have agreed that 
only a very snail number of high frequency words are actually used In 
the najority of writing and even fewer In speakiog. Jones and Wepman 
(1966) found that 33 spoken words used by adults accounted for more than 
50 percent of all the words they recorded. Herdan reported that for printed 
English, the 6? most connon words accounted for 50 percent of all words 
counted. A study of the Thorndike-Lorge 30,000 word list by Jones and 
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V/epnan concluded that 89 words accounted for 50 percent of alt words used 
in printed English. (The differences between Herdan's conclusion and that 
of Jones and Wepnan may be caused by differing approach, but it might 
mean that stylistically, printed English Is becoming more laconic, since 
Thorndlke's material on the whole Is rather old.) At higher percentages, 
Eldrldrje (1911) found that /50 words constituted 75 percent of words generally 
used in newspaper English. Cook and O'Shea (19U) found that 763 words 
constituted 90 percent of words used In correspondence (but ^2 percent 
of the 763 v/ere highly repetitive function words) . Dewey (1923) concluded 
that 1000 words constituted 75 percent of words generally used In printed 
American English. The Dell Telephone System (1930) calculated that 700 
words constituted 95 percent of all telephone conversations. D. B. Johnson 
(1972) reported that the most frequent 2000 words in Czech, English and 
Russian account for between 75 and 30 percent of words normally used in 
print and that 5500 to 6000 words will Include over 90 percent of general 
reading material. Johnson's studies, thus conf I rn, In general, Horn's 
remarks on the point of dinlnlshinn returns near the kf)r)0 mark and West's 
opinion that 7000 words are required for readinn novels comfortably. 

The foregoing figures would Indlrate that a lOOO word vccabulory 
is a good starting point, but a more representative list like the Thorndike 
10,000 word list is nt'ossary to provide the Jess frequently used words 
required to bring the student up to proficiency In rnading and speaking, 
as defined In terms of vocabulary size by \/esc and Johnson. Palmer believed 
that it is best to have a United general vocabulary cs a base and to 
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supplement it with lists designed for specific technical, vocational, 
and academic fields. Experience with Fundamental French (1959) and Basic 
(Spok en) German (196^) appears to support that belief. 

Dispo sition Q,f_.Hinh Fre qu ency Function V/ords 

A problem that hsi, to be resolved with respect to word lists dtrived 
from frequency counts Is what to do with the most frequently occurring 
words which Invariably appear In the first 500 words of any frequency 
count of a given languaqe. These are the so-called grammatical (structural 
or relating) words with which we speak or write and for this reason appear 
to be largely independent of subject matter. The content words. I.e., 
what we talk about, are largely nouns and verbs, many of which have very 
specialized meanings. They are, therefore, highly situational and, in gen- 
eral language, have a lov/ frequency of occurrence. As a result of the high 
frequency and constant appearance of the structural words, word frequency 
counters have as a matter of routine deleted from 50 300 of the most connon 
ones from their frequency counts and have placed them in separate lists or 
appendices. These are words which experience has shown will appear with the 
hifi^hest frtv^uency and, therefore, early in any frequency-based vocabulary 
count of the language. They are listed separately In order that the main 
lists may concentrate more on the "content" words of the language. Also 
separately listed by most counters, but for reason of their specialized 
use, are cardinal and ordinal numbers as well as proper names and place 
names which are so subject or area related that they warrant no place on a 
general word list. 



\Vord .G roups or C o 1 1 oca t tons 



Another problem which has to be addressed in frequency counts is 
that of what are commonly called "idiomatic expressions". Some special 
ineaninos appear to depend on the combined sense of a more or less fixed 
association of words which have come to convey a meaning separate and 
distinct from the sum of the meanings of their component words. Palmer 
made a study of "idiomatic" expressions and arrived at the conclusion 
that the tern was actually a misnomer. In the "IRET Second Interim Report 

on English Collocations" Palmer reported on his study of the overlapping 
fields of vocabulary and syntax. The so-called "idioms" fall Into linguistic 

groupings which Palmer called "Pliologs" or. "something more than words". 

Within Pliologs he distinguished among "linguistic formulas" (conversational 

expressions, proverbs, aphorisms, and quotations), "syntax patterns" (mainly 

grammatical), and "collocations" proper, such as verb-, noun-, adverb-. 

and preposition-collocations. However defined, these word groups are an 

essential aspect of language proficiency. 

in the 1929-30 period, before Palmer published his study, three 
"Idiom" lists were published under the auspices of the Canadian and Ameri- 
can Committee on Modern Lanugages or the American Council on Education. 
The first was Hauck's on German, which supported B. Q. Morgan's "German 
Frequency Book" with 959 idiomatic expressions based on a minimum frequency 
of two and a range of one. The second was Keniston's on Spanish with 
1293 entries which were checked against Buchanan's "Graded Spanish Word 
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Book". It placed primary emphasis on range as a criterion for Idiom selection, 
using three out of a hundred as the cut-off point. The third was Cheyd- 
leur's on French with 172A entries with a minimum range of three out of 
07 sources. 

Sometimes Idiom lists such as Hauck's and Kenlston's above, were 
not only checked against or designed to support a specific word list, but 
were derived at the same time. An Instance of the latter Is de la Court's 
collocations which were a part of his list called "The Most Frequent Dutch 
Words and allocations". I t had a list of 3296 words and about 2000 
collocations as defined by Palmer, above, In his second IRET Report. The 
so-called "Linguistic Formulas" (somedmes called Category II Items) such 
as proverbs were not induced since none attained the cut-off frequency 
of five. 

In German, two more idiom lists arc important; Purin's and Pfeffer's. 
with Pfeffer's being by far the more important. Purin's "A Standard German 
Vocabulary of 2?32 Words and 1500 Idioms" is a secondary list derived 
from prior vocabularies and frequency counts (1937). It is of interest 
since, like de la Court. Purin recognized that basic concepts and meaninqs 
are often conveyed by idiomatic type expressions and deserve recognition 
as a part of vocabulary. It is also of interest since in it the meaning 
of the idiomatic entry was frequently illustrated by using the Idiom in 
a contextual utterance. 
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Perhaps the best of the current Idiom lists Is Dr. Pfeffer's "A 
Spoken German Idiom List" (1968). It Is third In his excellent series 
of Spoken German. Pfeffer does not refer to Palmer's studies on Idioms, 
but he does refer to the Hauck, Kenlston, and Cheydleur lists mentioned 
above. He defines his Idioms as "semantic restrictions of syntactically 
collocated parts" In which varying degrees of restriction may occur. The 
Pfeffer list was derived with the aid of computers from the research 'one 
to produce his "Basic (Spoken) German Word List" (I96A) and "English Equiva- 
lents" (1965). Pfeffer selected 1026 Idioms from the oral material of 
his Word List with a frequency to range ratio of 3/2 or higher and 99 
others which were discovered while developing his topical (utility or 
available) words and while rounding out his Word List by empirical additions. 
All of the words composing the 1026 idioms are found in his "Basic Word 
List". The total of 1125 idioms represent about 05 percent of the restricted 
forms and related patterns (idians) found In spoken German. 

Need for Uniformi ty 

It Is apparent In reviewing the history of word frequency counts 
and related vocabularies or word lists that the methods and techniques 
are as varied as the researchers and their purposes. Now that the science 
of word counting Is evolving rapidly, with the oral count coming of age 
as electro-nechanlcal techniques of recording speech have becone available 
and both oral and written/printed counts becoming subject to manipulation 
by computer, It would appear that we need a new convocation of word counters 
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Sim lar to those sponsored by the Carnegie Foundation In 193^-35. The 
purpose would be to coordinate the efforts of the many researche.s by 
exchange of Information, deciding on definitions, and discussion of the 
relative merits of the several methods and techniques being used to arrive 
at the many subjective decisions which have to be made. It Is precisely 
these differing techniques, methods, and subjective decisions th t make 
much of the research so diverse as to make comparisons Impossible without 
considerable manipulation. 

Evidence that this need for uniformity Is. and has been, felt 
Is found not only In the Carnegie Conferences under the leadership of 
West, but also bv the observations of others who have been frustrated 
In their attempts to grasp the status of the developments in llnqutstlc. 
and language teaching because of the lack of comparability of the efforts 
of previous and contemporary researchers In the field. Such lack of unl- 
formtty has occasioned extensive efforts to recast the work on one study 
In terms which will make It comparable with that of another. These are 
required to avoid invalid cc^parlsons or simple inability to find common 
ground. Rolf-Dletrlch Keil of Germany has recently addressed this subject 
<ind made a passionate plea for standarlzation In his "Elnhel t) Iche Methoden 
In der Lcxikonetrle" (1965). 

Fornats for D' s p^lay^j;>f^J^^»c.. ) ts_ 

The basic formal >f most frequency counts has been to list the 
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selected nunber of words or other items in order of frequency and then 
list them in alphabetical order. An expansion, as range or distribution 
began to be considered, was to list the selected words according to frequency 
and to add the range in a parallel column, or vice versa depending on 
whether frequency or range was considered the mere important. An alte-.iative 
relative ranking can be obtained by any of the various formulae combining 
frequency, ranqe and other value Judfiements into some numerical index 
representing word importance. This compos i : ; value is used to determine 
the order of listing of words. Total frequency and range are then listed 
in parallel columns for each word. A refinement of the above is to add 
columns for each word indicating its range and frequency in each of the 
cat4.jories of material from which the count was compMed. A final refinement 
is to add a frequency count of the gran-rf.u.tical uses of each word. i.e.. 
how many times it was u«?ed as a noun, verb, adjective, or adverb. 

In the final analysis, the format for display of results depends 
on the purpose of the count and the uses to which \i is expected it will 
be put. Usefulness to the reader is the most important criterion. With 
the assistance of computers, the variety of formats of display of material 
has increased enormously and there is little reason, from the point of view 
of time, not to present the material in its most useful form for one or 
several groups of consumers. 
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Value of ObJecth/ eJjnrH Cou n t s 



There arc still some who, like Palner in the I930's, believe that 
objective counts are useless or, at best misleading, or If they do produce 
anythinc, It Is only a passive (reception) reading or listening vocabulary. 
One of the more critical articles Is W. E. Bull's "Natural Frequency and 
Word Counts" (19^9). The subtitle "The Fallacy of Frequencies" Is a good 
Indication of the tenor of his article. Bull argues that: 

1. There Is an Inverse relationship between natural frequency of a 
gramnatical form (such as a noun, verb, article, or adjective) and the fre- 
quency with which each form is used. This is borne out by the high recorded 
frequencies of the relatively few grammatical (functional or structural) 
ivords (articles, adjectives, pronouns, prepositions, conjunctions and 
relating verbs) which provide structure to the language, and often have 
fnultl -meanings, althouqh they are not content-bearing words. On the other 
hand, the real "content" words which convey the meaning of what we talk 
about tend to have fewer meanings per word, perhaps only one, and refer 
to speci c objects and situations. 

2. Any word count is statistically valid only for what is Included 
within It. Keil recommends at least ten million words (1965). Variation in 
corpus selection does make a difference in the words discovered. That 
fact Is reflected In the decreased comparability among frequancy lists 
after the first 1000-150n words. 

■j. Extremely hlnh frequency words are rarely tho content-bearing 
elernents of any conmunlcation. 
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^. Range and frequency are determined by two different forces; lin- 
guist I c and cultural. 

5. It cannot be assumed that there is a correlation between fre- 
quency and utility. This depends on what is meant by "utility"; a struc- 
tural word Is being used grammatically and necessarily so. It. therefore, 
has "functional utility", but It may not be used to couvey the real cultural 
meaning of the utterance and. therefore, lacks "concept conveying utility". 
That this observation Is true substantiated by the need to discover 
utility (available or topical) words by "centers of Interest", "topical 
subjects" or other met'iods used In developing "Fundamental French (1st 
Level)- (French Ministry of National Educ. tlon-1959) . the "Basic (Spoken) 
Gernan Word List (1st Level)" (rfeffer- ISfii,) . and the "Puerto Rlcan Spanish 
Vocabulary Count" (Rodriguez Bou-1952). 

6. There are so mny factors and uncontrol lau le elements 
In life and language that no satisfactory results can be obtained by attemp- 
ting reduce such natural heterogeneity by statistical methods. Word 
counts cannot be considered a valid representation of a people's culture 
and linguistic activities. As a result, their pedogoglcal usefulness 
Is extremely dubious. 
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In gener.i. Oull's arguments at least point up the kinds of questions 
which need to be addressed If frequency counts are to be usefui. Havever. 
Cull's overall condennation of w.rd counting Is too strong when one considers 
the better noJcrn (modified objective) research methods such as those 
employed by Pfeffcr since the early 1960's In his continuing study of 
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German. Dr. Pfeffer is ujJng for spoken German, a combination of nearly 
spontaneous speech on general subject areas, topical areas of Interest 
to elicit available and utility words on specific subjects, and empirical 
(pragmatic) examination and comparison of the results of the first two 
methods to supple-nent his word lists. Aided by computers, he has proceeded 
from his basic word list to semantic classifications with their English 
equivalents, and has, thereafter. Isolated semantical ly restricted combinations 
of words in his "lulom Mst" (1963). 

Dr. Pfeffer' s Improvement on the Palmer formula of objective, sub- 
jective, and pranmattc procedures for developing vocabularies is encouraging. 
Coupled with sophisticated measures of word importance as developed by 
Mackey (I96/) and his associates at Laval University In Quebec, the Pfeffer 
research should result in extensive and profitable pedogoglcal use In 
the teaching of German and, by transfer, in the teaching of other languages, 
in spite of Bull's earlier pessimism (19^9). 

Pfeffer' s study of Spoken German, together with "Spoken Russian" 
(Vakar-1966 and 1969) and "Fundamental French" (1959), coupl'jd with the 
Wepman and Mass children's count (I969), the Jones and V/epman adult count 
(1966), the Howe's adult count (1966), the Beier, Starkweather, and Miller 
children's count (196?) , the Berger count of conversation (1967,1, and 
the Black and Aushernan college student speech count (5955), have given 
impetus to studlus of the spoken lanquag;. Comparative analyses of the 
difference b<5twecn conversational oral speech (Berger) and more formal 
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classroom presentations (Dlack and Ausherman) can and should be made. 
Concurrently, however, we should also maintain the Impetus of work In 
the field of printed and written language as 1 1 lustra.-ied by Kucera and 
Francis (adults-1965) and by Carroll, Oavles, and RIchman (chl ldren-1971 ) . 

At the same time, there is a need to explore new fields, such 
as those Indicated by Richards and Shapiro. Richards (1970) developed the 
concept of "Word Familiarity" as an alternative means of eliciting the 
less frequent content bearing (utility or available) words required for 
balanced vocabulary development as alternative to the "Centers of Interest" 
approach used In Fundamental French. However, the subjective scaling 
technique itself appears to be bounded by groups of individuals of like 
social, cultural, and intellectual levels. Shapiro (I967) demonstrated 
to his own satisfaction that relative word frequency is s "prothetlc" 
variable and that "magrntude estimation" Is a suitable scaling technique 
for subjective estimation along that continuum. If this be true, we may 
be able to avoid having to use the large srale objective frequency and 
range counts we have used ?n the pa; t by proccding via subjective scaling 
based on words selected to obtain results equivalent to, or better then, 
those obtainable from objective counts. 

It still is not apparent whether the Richards and Shapiro techniques. 
If fully developed and proven, will eliminate the deficiencies Bull found 
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In frequency counts, or whether the modifications introduced by Pfeffer 
In his study of German (oral-toplcal-empi rical approach) will do so, but 
certainly we should continue to explore them all In an effort to Improve 
our ability to develop better vocabularies for efficient and economical 
language Instruction, 

Summary 

In summary. It may be said that word frequency wrountlng has evolved 
complexly In the past 2000 years. With Increased knowledge In the pnyslo- 
loglcal, psychological, educational, and linguistic fields, and wl :h the 
aid of tape recorders and computers we can now do much that we formerly 
could not. However, much remains to be done In understanding the interre- 
lationship of culture and linguistics; of la langue, and la parole; of 
the relationships between active and passive vocabularies; and between 
oral and written language, as well as hav best to present them to the 
student to facilitate his learning. Much also needs to be done In perfecting 
techniques of language analysis. In order to ensure uniformity of method so 
that Information gained may be better transferred to that common fund 
of linguistic and cultural knowledge from which future advances may come. 
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Analyses of the Statistical Lawfulness of Vocabulary Distributions 

SECTION III 

Language like other natural systems has been an object of study since 
man has engaged in such enterprises, or at least for as long as we have 
preserved record of such study. As a natural phenomenon, it presents a 
uniquely different challenge to the naturalist, however, which Is not 
shared by those systems which can be construed as purely physical. As 
with other aspects of human behaviour. It preeminently Involves Inten- 
tional motivations which underlie and give purpose to the objective mani- 
festations which are open to study. Thus, the early studies of language 
as system concentrated almost exclusively upon Its intentional aspects; 
the meanings and symbol processes In whose service It was employed. Two 
developments, however, presaged a different but parallel method of Inves- 
tigation; the Invention of movable type and the rise of enumeration as a 
measurement tool. 

tt is the original Invention of writing which In very large measure 
has defined the word entitles for which the modern scientist has sought 
laws. The definition of this entity has remained moot since scribes have 
sought to record the contlnous stream of sound which Is language. But, 
with the Invention of movable type and the consequent wide distribution 
of printed language, the definitions employed by the makers of books If 
nonetheless arbitrary became at least conventional and consistent. For 
without iuch consistency their products could not have been successful. 
Thus, In a sense the problem which this paper seeks to address Is a roan 
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made problem. We invented a unit called the word for largely conrnercial 
purposes and then decided that we should study our own Invention by appli- 
cation of another of our inventions, namely counting. Once set in motion, 
however, the process appears to have assumed a life of its own — in all 
regards words appear to have a natural life which share the characteristics 
of those systems we did not create and their counting has become a scholarly 
discipline of its own commercial and intrinsic value. 

Although measurement by enumeration itself stretches far back Into 
roan's time. Its early uses were more linguistic and qualitative than quan- 
titative. Measurements of sacks of grain, wealth or live-stock required 
only that the measurement scale enumerate the finite and directly countable. 
Such scales have the characteristic that they isomorphical ly map the objects 
of enumeration explicitly to an only nominally representative set of numbers. 
The nominal use of numbers as a measurement device is exemplified by such 
modern devices as numbering the members of a football team or labeling our 
coinage with denominations as qualitative categories which only partially 
reflect their extrinsic values. In such measurement, one moves from few to 
some through many too many counts. One speaks of a lot of money or 
more money than can be counted. The enumeration remains limited by the 
mechanics of physically mapping the objects Into their numerical represen- 
tations. A clay tablet which iz to be used to record the number of animals 
Involved In a business transaction serves only because Its size and the 
number of potential mappings are well suited. Motions of an Infinite 
number of animals or of negative amounts of wealth were as meaningless as 
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they were impractical. It Is meaningless to declare that I have two and a 
half cents In my pocket or that If you have five cents In yours that you are 
twice as wealthly as I. And although you may declare that i owe you chree 
cents in exchange for an article set at that value when I protest that I 
have zero wealth; I.e., that I have -3 elemental units of money, having 
minus a billion maximal units cf enumeration would be treated In precisely 
the same way, that is, as without meaning. Such vagaries are inconsistent 
with the precision which is required of enumeration as a measurement tool. 
The post Renaissance development and acceptance, however, of arithmetic 
manipulations which bore no extrinsic relationship to the practical useful- 
ness of enumeration suddenly opened a fertile field of speculative and 
theoretical implications of the natural lawfulness of the countable. It 
was not, surprisingly enough given the modern acceptance of such operations* 
until the I6th century that such arithmetic operations as were ac- 

cepted as other than an absurdity. And only still more recently with the 
introduction of the Calculus that the succession rule defining Infinity 
has been accepted. 

A Taxonomy of Scaling Operatio ns 

The process of assigning .umbers to phenomena within the structures 
of a well formulated and explicit set of rules is known as measurement. 
These measurements, in turn, purport to be the quantification of a defined 
set of attributes. The measurements represent a model of the attributes 
which may or may not fit the facts; I.e., may or may not accurately or en- 
tirely depict the behavior of the phenomenon in question. One may adjudge 
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the adequacy of the model as representation of the phenomenon It describes 
either or simultaneously by reference to- the accuracy of the deductions de- 
rived from the model with respect to the phenomenon's behavior or with re- 
spect to the validity of the measurement rules used to derive that model. 
Thus, much of this paper will be concerned with an evaluation of the ade- 
quacy of the fit of variously proposed models of language enumeration and 
the assumptions of measurement Implicit to these models. In order, however, 
to understand the deeper Issues Involved in the tests of adequacy of these 
models. It Is necessary first to discuss the broadest Implications of 
measurement per se. 



There are four fundemental types of measurement. These forms of mea- 
surement differ In the nature and number of assumptions which are held to 
be characteristic of the qualities they seek to describe. 

Nominal scales . The first of these forms of measurement, already 
alluded to In the opening discussion, assumes only that it is possible to 
identify the equality or non-equal Ity of any two attributes. Let us take 
as Instance the quality of "wordness." Nominal scaling of this attribute 
requires only that we correctly assign our units of measurement such that 
Instances of different such qualities receive different measurements and 
that identities of the quality receive unique measurements. That Is to 
say, we are only required to make explicit those operations which allow 
us to Identify the sameness of the phenomenon to be measured — to recognize 
the re-occurrence of the same quality attribute when It re-appears and to 
distinguish such re-appearance from instances which are not the same. Thus, 
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for example, the text of this paper could be re-expressed as a nominal 

scale which assigned numbers to each of the groups of inkmarks bounded by 

absence of inkmarks on the basis of their unique patterns. Hence the list: 

Language 1 
like 2 
systems 5 
engaged 1 5 
language 92 

satisfies the assumptions of a nominal scale in that the numbers do nothing 
more than uniquely reassign names to the inkmark patterns on the basis of 
their qualitative characteristics. Under the measurement assumptions of 
this example, it is inkmark pattern whose equality or non-equality is at 
issue. If the attribute of "wordness" with respect to some other quality 
of inkmarks Is at question, then we should be required to provide an explicit 
statement of the recognition of the equality of that aspect of Inkiness. 
Observe that the essential character of the scaling operation remains 
unchanged if we reassign our measurement numbers by any arbitrary schema 
so long as we preserve the assumed characteristic of pattern uniqueness. 
It is, however, a gross violation of that assumption to attribute additional 
meaning to such a scale. We are not, for example, permitted to assume that 
the above list of numbers Implies that some inkmarks are "1arv,er" or "bigger" 
than others, and certainly not that some inkmark Is X times "larger" or 
"bigger" than some other Identified Inkmark. Although the Inkmark pattern 
engaged has been assigned a value which Is three times larger In magnitude 
than that for systems , nothing other than uniqueness of pattern is implied 
by those assignments. Under the operational assumptions of this measurement 
operation, we are not permitted to question the values which may have been 

108 



chosen. However, we may, indeed, question the validity of the assumption 
that either the quality of pattern uniqueness was correctly identified or 
that even if correctly identified it has anything meaningful to say about 
the nature and uses of lnl«narks. 

Ordinal scales * if we wish to have our measurements reflect the 
additional attribute of magnitude in its simplest form; i.e., attributions 
of greater or less than, we are required to make explicit the measurement 
operations which are to be employed In Identifying that attribute. Thus, 
we might, for example, define the quality of length of Inkmark pattern as 
the measurement operation of comparing each pattern with every other pat- 
tern to arrive at judgements of which patterns were longer, shorter or 
equal to which other patterns. Such a measurement scale for the same 
example of Inkmarks might take the following form. 

Language I 
like 8 
systems 2 
engaged 2 
language I 

Observe that this new scale not only identifies inkmarks which are same or 
unique with respect to pattern as defined by length but additionally quan- 
tifies the attribute of "length." It still, however, explicitly does not 
capture the quality of magnitude of length as It Is normally conceived. 
Hence, the example tendentlously shows the pattern 1 Ike as having a scale 
value 7 units greater than the pattern with lowest value despite the fact 
that there are only five patterns. Note, as well that wo seem to have 
serenaipi tiously captured a quality attribute we had not set out to measure. 
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Unlike the nominal scale, this scale has assigned the two instances of 
the "word" language the same value. Thus, we can say that the measure- 
ment operation defining pattern length apparently more closely matches 
some of our uses of ••words" than does that of Ink patterns. It, nonethe- 
less, of course, fails to differentiate items which are clearly differen- 
tially used and as such still does not capture any semantic quality. 

Interval scales. If we were to require that our measurement scale 
express the additional attribute of magnitude of difference between pat- 
terns, we should have to define an operation by which we assessed that 
attribute In add'tton to the definitions already adopted for the other 

4 4 

attribute qualities. Thus, we might define pattern ••length^' as number 
of discrete inkmarks within a pattern. Such a measurement operation de- 
fines ••length^* strictly In terms of number of elements. Thus, the patterns 
of the example might be scaled as follows: 

Language 8 
like k 
systems 7 
engaged 2 
language 8 

Note that now we are permitted to make comparisons of both the attribute 
qualities of "more than" and by how much. Thus, the pattern language 
occupies a magnitude position with respect to UJce which Is identical to 
that of Language . It is still not possible, however, under these measure- 
ment assumptions to identify the equality of ratios of such magnitude. 
Thus, we cannot assert that the difference between language and 1 1 ke stands 
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In the same ratio as , jnguage does to 1 1ke . In order to make such an 
ascertainment we would have been required to define the notion of zero magni- 
tude of the quality being assessed. The definition which was provided makes 
zero length of letters an absurdity or at least unmaasureable. It would be 
impossible to Identify a non-occurrence of a discrete Inkmark bounded by 
non-occurrences of discrete inkmarks. So long as the scale orgin is either 
undefined or arbitrary with respect to the quelity Involved In the measurement, 
we are permitted to transform our measurements to any new set of values which 
can be expressed as a linear equation of each other. Hence, we are permitted 
to transform the example values by multiplication of 2 and addition of ten 
to arrive at the new values X' resulting frotii the equation: X"-2X+I0. 



These new values of X exhibit precisely the same measurement attributes as 
the old. The magnitude of the intervals separating each measurement entity 
has not changed their relative positions with respect to each .^ther. 

Ratio scale s. The final and most restrictive form of scaling seek'; to 
identify the attribute of equal ratios of quality attributes. In order to 
do so, such a scale i.iust define the attribute of absolute absence of the 
quality, equality of attributes, magnitude of difference between attributes 
and equal ratios of those attributes. Few or none of the scaling tech- 
niques typically employed In the social sciences can boast ratio character. 



X 



X' 
26 
18 
2i* 

26 



Language 8 

like li 

systems 7 

engaged 2 

language 8 
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Physics, on the other hanC, typltially employs such measurement. What is 
deceiving Is the fact that the measurement of numerousity is invariably 
ratio in form. As a consequence, It is a co.nnon erron' to assume that any 
set of numbers which can be construed as reflecting the number of some 
quality is ratio it> character. But unless the specific measurement oper- 
ations which define the necessary assumptions of such a scale are made 
explicit, the conclusion will certainly lead to misuse of the scale. Thus, 
for example, if the I.Q. scale is interpreted as a sc-ale of numerousity 
of intelligence points, we are lead to the gross error of ratio assertions 
regarding thi; differences between persons with differing I.Q. , not to spealc 
of absolute magnitude assumptions about those differences. Similarly, a«'<d 
more pertinently, measurement of the frequency of word units in a text leaves 
undefined the attribute of absolute zero occurrence of a unit. Zero fre- 
quency is an arbitrarily assigned measurement which ambiguously implies 
either non-occurrence in the sample or non-occurrence in the population 
which represents the total language. As a measurement of non-occurrence 
in the population its real value as a measurement indifferently extends 
from minus infinity to zero. 

The Word as An Attribute of Measurement 

Any discussion of measurement as specifically applied to vocabulary 
must grapple with the definition of the attribute of 'Wdness." Although 
it is clear that the user of a language finds the notion of word psycho- 
logically meaningful, attempts to make the notion linguistically explicit 
have not been successful. Greenberg (1957, p. 27) has summarized the 
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linguists position on this matter ds follows: "Some linguists deny any 
validity ^ the word as a unit, relsgatlng It to folk linguistics. Others 
believe th. :Ue word must, be defined separately for each language and 
that there are probably some languages to which the concept is Inaprll- 
cable." Nonetheless, Sapir (sue Ulman, i362, p. 39) has observed that 
"The naive Indian, quite unaccustomed to the concept of the written word, 
has nevertheless no serious difficulty In dictating a text to a linguist 
student word by word; he tends, of course, to run his words together ai; 
in actual speech, but if he ts cn led to halt and Is made to understand 
what is desired, he can readily Isolate the words as sue, repeating them 
as units." For those languages In v^hlch there Is a rich cultural tradition 
of writing ^nd literacy, the word a'i :jpprehended by its speakers might be 
construed as little more titan the propogatlon of the conventions of wrUl»ig. 
It is in this sense that most counts which define the word as that which 
is conventionally bounded by spaces in printing define their measurement 
units. Even In this narrowest of senses, the study of such conventions 
might be of interest. But, It Is the search for the word mr re broadly con- 
sidered which is of particular Interest: Its psychological and lingulr.tic 
significance. 

The word as psychological unit . The child's original exposure to lan- 
guage Is solely vocal In form, we reserve Instruction In writing and reading 
until rather late In the child's development, or at least until the spoken 
language Is reasonably in hand. But even the child's ear1!ef*t vocal expe- 
rience Involves considerable emphasis on those Isolatable units of speech 
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which have unitary symboll' value. The child seeks and U given names of 
things and these names are typlcslly those conventionalized units which we 
normally call words. In those high cuUnres for which literacy Is sanctified, 
the parents of children anticipate and transmit the written language conven- 
tions. Thus. It is the rare parent of a high culture child who viould respond 
to a query regarding a drum, with the response "Thatsthethlngthatgoesboom. ' 
It 15 much more likely that the parent will pare the response down f.o the 
minimally Isolatable unit of semantic Ira^nt which comes closest to the con- 
ventional lexical entry 'or that object: I.e.. "drum" accompanied by an 
appropriate pointing gesture rathir than "Thatscal ledadrum" or even 
"Thatsadrum," Further, once the child Is made literate, what may have 
begun as a printer';^ convention Is perceived as a psychological neccesslty 
which takes on Its own significance. Later on should this now literate 
child be required to learn a second language, he will find it both effica- 
cious and satisfying to learn a vocabulary of words for that language and 
even to expand his own tongue by study of Ui lexicon. Finally, the adult 
speaker of a language with written traditions will unerringly Identify upon 
request what is or Isn't a word. And even, according to Creenberg. those 
adults without writing can do the same. 

The word as linguistic unit . Assuming then that a reasonable case can 
be made for the word as a psychological reality, there remains the question 
as to whether or not there exists a linguistic definition which can serve 
as an explication of the concept. That is to say. can we provide an ex- 
plicit theory of "wordness" which is independent of the user's perceptions 
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of the conventions he employs. Such an explication ir.iplies what Chc*m»Uy 
(1957) has called "Gxplanatory adequacy'' in contrast to the descriptive 
adequacy which might be served by reference to the conventions of a partic- 
ular langu.ige, whether written or spoken* Accordingly, It Is clear that 
when considered in this light, the answer to such a qu&stlon des at the 
heart of the complete theory of any language and as such will be extra- 
ordinarily difficult to attain. The most modern of granwatlcal treatments 
which would seel( an account of the structure of language typically eschew 
the problem as prein4':ure, choosing Instead to assume the weal^er requirement 
embodied in the presumption of 3 commons uns leal appreciation of what a 
word Is as commonly understood (i.e., psychologically apprehended) by the 
users of the language. 

It is thus not accident that one must search backward Into the Bloom- 
fleldian era to find attempts at a linguistic definition of the word,, au 
era for which descriptive adequacy was the prime consideration. Bloomfleld 
(1933) attempted to define the word by reference to the formal characteristics 
of syntactic boundedness. Those minimal forms which can occur as sentences 
he termed free forms and those which are never used as sentences bound forms. 
Words, as they arc commonly used, are those minimal utterance units which 
can occur as free forms. What distinguishes words from other free forms Is 
that words cannot be spilt into still smaller forms without leaving a bound 
form residue. 
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It is not difficult to find instances of what the speaker of this 
language would psychologictil call words which would no: be called words by 
Bloomf ield 's definition. AH compound forms composed of two or more inde- 
pendent words (by either the conventional definition or the definition under 
test) such as penknife or yardstick provide paradoxical exceptions to the 
d(;finition. Similarly, '.he functors such as £ or the must occur as bound 
forms under the ueflnltlon and /et they are clearly apprehended as psycho- 
logicolly defined words. The meta- language arguments cannot serve to rescue 
th^ definition, for all such arguments neccessarily involve the definition 
we seek as a presumption. Thus, to say that "The." is a permissable sen- 
tential response lo the question '\^hat is the third word of this sentence?" 
would only make the issue more cloudly than It already Is. 



The word as lexical entry » Lexicography its best represents the 
structural and functional characteristics of ^ language as It is conven- 
tionally employed, at least, by those who are largely rcisponslble for 
shaping the culture defined by that language. At its worst, it represents 
a set cf normative prescriptions regarding Its language hardly even chwr- 
acterizing its use by those pedants who would prefer proscription to des- 
cription. The conventionality of either the description or prescription 
of its source books is largely dictated by the vicissitudes of publishing 
and data collection. But such conventionality serves, nonetheless, to 
represent the conventions of the language usag^ and as a normative model 
of such usage to itself perpetuate those conventions. The conventions. 
In turn, capture the aggregate distillation of the psychological realities 
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by which the language user accounts for his language. New words and new 
usages replace old conventions at the leisurely pace of slow moving publishers 
who thus assure that the changes have already been accepted as conventions by 
the majority of their users. All of these factors in combination serve to 
make the lexicographer's source boolc an unequaled arbiter ot the problems 
of riefining wordness. 

The word as granwatical form . Conventionality in language usage extends 
beyond tht boundries of wordness and arbitrary meaning to function and 
structure. Grammatical classes or parts of speech as they are more tradi- 
tionally caHed, codify by label the functional elements which the language 
user ueenis essential to his account of the structures he employs. Whether 
or not such labels have real explanatory meaning In the theory of language 
is moot. But, again, as conventions they do have at least psychological 
meaning which even if without linguistic validity at least deserve recog- 
nition by dint of the universality of their acceptance in Instruction and 
perception. And, as before, such purposes are best served by the conven- 
tionality of the language's dictionary or alternatively as in this research 
by a structural definition derived from the mutual substitutabil Ity of 
speech parts In language frames which model their usage. 

A specie) set of problems . There exists a grey area of wordness for 
which no solutions are readily available. Compound forms that have not as 
yet made the complete and preferred transition from multiple words through 
hyphenated forms to single units or fixed collocations too extensive in 
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length to move Into the hyphenated life form but nonetheless function as If 
they were single units, and learned forms which because of thQ pedantry of 
their users cannot be tolerated to change, all represent exceptional cases 
for which It Is difficult to devise other than ad hoc and arbitrary solutions 

Inflectional forms In those languages for which such grammatical mech- 
anisms are productive do not, however, represent a particularly difficult 
problem. It Is possible to Identify variant forms of the simpler root form 
on the basis of their derivation from a paradigm. Such a paradigm has the 
characteristics of regularness and of limiting the number of variants to an 
absolutely small number. Adverbs, In English, for example, are very largely 
paradlgmatlcal ly derived from their more productive adjectival roots by the 
single pattern form of -ly . 

A functor may be defined as any free stand Inf word form In analytic 
languages which is lexically defined as serving strictly grammatical rather 
than referential functions and for Inflectional languages as that morpho- 
logical change of the stem which carries such meaning. This definition 
facilitates the counting of both lexical forms and grammatical patterns. 
In the first instance, the working definition of functor is used to sup- 
press those elements which, occurring with such overwhelmingly high fre- 
quency, tend to usurp the lower-frequency, but higher information-content 
forms. In this sense, "functor' is a convenient catch-all for those 
terms in a language which are finite in number, but which account for a 
greatly disproportionate frequency of occurrence. Display 1 illustrates 
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the frequency count equivalent to their total occurrence in the elicited 
samples. Words which can be generated paradigmatical ly from a base form 
can be collapsed Into the base form which will then receive a frequency 
count equivalent to the total occurrence of the paradigm membership; thus, 
all variations of verbs due to inflections for person, number, and tense 
can be counted as instances of the base form. 

•SI 

A final word about word . In the end, the final definition of wordness 
rests entirely upon the conventions of usage in two senses of use. First, 
we may Interpret and operational Ize the psychological apperceptions of the 
langt^age user for an answer to the meaning of word. We require only th&t 
the user recognize and distinguish those units which he would construe as 
words. We do not require that the user explicitly define or understand 
the processes by which such recognition Is achieved. Where dictionaries 
exist, these source books provide the best aggregate judgements of such 
recognition, where they do not we shall have to compile such judgements 
directly from the speakers themselves, in the second sense of use. It is 
the purposes of our definition of wordness which must be examined. In 
this paper we shall be focusing on the statistical lawfulness of ward 
occurrences. The test of alternative definitions of the word as unit of 
measurement rests entirely upon the empirical comparisons of the outcomes 
nf these definitions, uoes it as matter of empirical fact, made a difference 
in the characteristics of the functional lawfulness of vocabulary to define 
root variants as separate or same forms? When the uses of our definition 
of word are pedogical rather than theoretical, it is surely certain that 
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we shall at least require other tests of that definition; tests which will 
Involve considerations which are as practical as the model tests are theoretical. 



Statistics and Measurement; The Schemapirlc View 

Before beginning our survey of the statistical models which have been 
proposed for the distribution of vocabulary in language, It is appropriate 
to forewarn the reader of the theoretical distinction between counting and 
modeling, between empirics and schematics. Since that distinction has been 
for some time the special concern of S.S. Stevens, It is appropriate to 
quote him on the schemapirlc principle at some length. "Although measure- 
ment began In the empirical mode, with the accent on the counting of rnoons 
and paces and warriors. It was destined In modern times to find Itself de- 
bated in the formal, schematic, syntactical mode, where models can be made 
to bristle with symbolsc Mathematics, which like logic constitutes a formal 
endeavor, was not always regarded as an arbitrary construction devoid of 
substantive content, an adventure of postulate and theorem. In early ages 
mathematics and empirical measurement were as warp and woof, interpenetrating 
each other so closely that our ancestors thought it proper to prove arith- 
metic theorems by resort to counting or to some other act of measurement. 
The divorce took place only In recent times. And mathematics now enjoys 
full freedom to 'play upon syrrools,' as Gauss phrased it, with no constraints 
imposed by the demands of empir' ;al measurement. 

"So also with other formal or schematic systems. The propositions of a 
formal logic express tautologies that say nothing about the world of tangible 
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stuff. They are analytic statements, so-called, and they stand apart from 
the synthetic statements that express facts and relations among empirical 
objects. There is a useful distinction to be made between the analytic, 
formal, syntactical propositions of logic and the synthetic, empirical state- 
ments of substantive discourse. 

"Probability exhibits the same double aspect, the same schemapiric 
nature. Mathematical theories of probability inhabit the formal realm as 
analytic, tautologous, schematic systems, and they say nothing at all about 
dice, roulette, or lotteries. On the empirical level, however, we count 
and tabulate events at the gaming table or in the laboratory and note their 
relative frequencies. Sometimes the relative frequencies stand in isomorphic 
relation to some property of a malhemat ical model of probability; at other 
times the observed frequencies exhibit scant accord with 'expectations.'" 
(S.S. Stevens, I968.) 

it is obvious that Stevens might as readily and appropriately have cited 
the counting of words in the above passage. Adopting this schemapiric point 
of view, we shall for each of the models of vocabulary distribution to be 
reviewed, separately examine the schematic assumptions of the models, their 
fit to the empirical data and the psychological justification of those as- 
sumptions. But before proceeding there is still another consideration which 
must be addressed by any statistical model designed to account for an empir- 
ical domain, namely, the methodological problems of sampling. 
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Methodolo gical Issues In sampling . If one wishes to construe a selected 
corpus of language to be representative of some larger body of language of 
which that corpus Is sample, the researcher is compelled to provide a 
rational defence of the sample's representativeness. Selection by random 
strategy is designed r.o provide such Justification on the grounds that a 
random sample requires that all members of the population had equal prob- 
ability of being selected as members of that sample. Under such rationale, 
the occurrence frequencies of the units of analysis are both efficient and 
unbiased estimators of the population probabilities of those units. But 
then two problems arise, random with respect to what and how are we to 
translate random into a set of explicit procedures? The overwhelming bulk 
of research on vocabulary has concentrated on the written forms of language, 
the number of worthwhile spoken analyses numbers less than half a dozen. 
The preceding sections have reviewed and evaluated these studies. The 
populations represented by the spoken and written forms of a language are 
both different and same when viewed from differing standpoints. We have 
argued that at the level of the functor, the vocabularies of speech and 
writing are as alike as the linguistic code is Inflexible with respect to 
their grammatical function. At the level of substantive choices, the two 
are as separate as the distinction made by the culture between informal and 
format styles of communication, with an extensive penumbra area of overlap 
between those styles at the level of the higher frequency substantives. And 
from still another viewpoint, the two communication forms may or may not 
be different with respect to their schemapiric lawfulness, a consideration 
which we are now deferring. 
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But in a sense even the distinction being made between speech and 
writing is itself artificial for some purposes. Plays are written to be 
spoken and all writing must be spealcable if it is to conform to its parent 
linguistic code. Nor is it simple to classify the procedure which this 
research proposes as the optimal sampling strategy; that of eliciting re- 
stricted associations from the users of a language. That procedure is 
designed to bypass the writ ten- spolcen dichotomy by sampling from the 
highest frequency Items of the users vocabulary. The rationale of that 
assumption, In turn, rests upon the spew hypothesis. Under that rationale 
the problem of corpus length is also largely avoided, for no altempt is 
being made to fully sample the entire frequency range of vocabulary items 
as they appear in the population. The spew hypothesis quite simply posits 
that "...the order of emission of verbal units is directly related to fre- 
quency of experience with those units." (Underwood and Schuiz, I96O.) 



A number of studies have provided strong support for such an assertion. 
Johnson (1956) demonstrated that of the most frequent associations to 
the Kent-Rosanoff stimuli occured with a frequency ^^^^ times or more per 
million in the Thornd ll<e-Lorge list, whereas only ^8^ of the least frequent 
responses had equally high ratings. Howes (1957) computed the correlation 
between frequency of associations to the Kent-Rosanoff list and frequency 
of words in the language to be .9^ If functors are excluded from considera- 
tion. The effect has even been demonstrated when subjects are asi^ed to 
provide male given names; those names which occur most frequently in the 
written language are also those most likely to be given by a subject (Cromwell, 
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1956), Bousfield and Barclay (1050) have also demonstrated that the order 
of emission of verbal units Is directly correlated with their frequency of 
occurrence In the language. 



Taken In its weakest sense, the spew hypothesis is not an hypothesis 
at all. It is obvious that If emission of verbal units Is taken to include 
all uses of the language, the complete tabulations of such emissions are 
the frequencies of those units. But In Its strongest sense, the spew hy- 
pothesis provides a sampling strategy for estimating the total linguistic 
probability of verbal units. Construed as ad libitum responses, associd- 
tlonal responses obtained from subjects provide a higher face validity 
procedure for estimating the frequency of spoken language units. 

Either spoken or written data suffer from several Inherent difficulties 
which accure to the nature of natural language codes. The lawful statistical 
nature of such counts always produces ^ frequency ordering In which roughly 
half of the occurrence types have token realizations which are at the limits 
of measurement: I.e., have single occurrence frequencies. Probability es- 
timates of population frequencies from such inherently errorful sample fre- 
quencies are statistically unreliable. At the high frequency end of the 
distribution of such word samples one consistently finds that function and 
interstitial words account for disproportionately high percentages of the 
total sample. The situation Is roughly analogous to using the Wal 1 Street 
Journal to determine the frequency of English units. From such a data base, 
ordinal numbers and fractions would dominate the frequency distribution of 
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this disparity. In English, nouns, adjectives, verbs, and "pure" adverbs 
comprise over 99 percent of the total available vocabulary presented In 
the Shorter Oxford English Dictionary ; In contrast to this, we have for 
all the rejnalning parts of speech not more than 650 words. Yet these 
two groups provide approximately equal proportions of the total word usage. 
While all Croup II words in English are not strictly "functors", they all 
share three features of functors: (I) they belong to a small, limited, 
isolatable class; (2) they have paradigmatic features; (3) they occur 
with «Ktremely high frequency and, thus, suppress non-functor 1 Ike Group I 
words. It Is, therefore, our contention that functorlike words should be 
treated separately, both for lexical counts and, as It turns out, for 
grammatical pattern counts. 

in the case of strictly Inflectional languages, the paradigmatic func- 
tors will occur as bound forms in traditional orthography. This presents 
no problem other than Identifying these forms and coding them In such a 
manner that the "root" form will be the entry into the frequency count. 
In Latin, for example, agriaolae would be subsumed Into agriaola. 
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Approximate Occurrences of Parts of Speech in 
Shorter Oxford English Dictionary 

Group 1 

Nouns 58,000 
Adjectives 27,000 
Verbs 13.500 
Adverbs (A) 150 



98,650 (approx. total) 



Group 1 1 

Pronouns 100 

Prepositions 100 

Conjunctions 50 

Aux. Verbs 10 

Articles 2 



262 (approx. total) 
(*) Counts only "pure" adverbs not derived paradigmatically from adjecti 

Display I. Estimates of vocabulary words in different parts of 

speecn available in the English language (Yule, 19A^). 
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If the text contained one instance of agvioolae and one of aoHoolat the 
frequency tally would show agriaola as occurring twice. 

However, pure analytic and pure inflectional languages are the excep- 
tion, not the rule. Therefore, the treatntent of "functors" In the hybrid 
languages must allow for the uncluttered tally of words, yet preserve the 
grammatical patterning of occurrences. Thus, in German, for example, 
iiurn kleinen Kitid would be coded as preposition-definite article-adject ive- 
noun for grammatical pattern and klein would be tabulated In Its base form 
for frequency tally. Similarly, In English, oat and aats would appear as 
two occurrences of oat, since. In English the two forms can be considered 
as co-occurring items of a paradigm. Verbs would be treated similarly for 
frequency counts. The total tally for the verb, run, for example, would 
include occurrences of paradigmatic forms such as tmno, mn, and running, 

Th2r#» are other common words which should be given separate treatment. 
For example, numbers, certain l^lnshlp terms, days of the week, month of 
the year, and the like require special attention. The Urm Monday should 
be taken to include the terms for the other days of the week as though it 
were a root form from which the others are derived. Thus, all names for 
the days of the week which are elicited would contribute to the frequency 
total for the base form, arbitrarily taken to be .Vondai/. Similarly, in 
English, the terms for the members of the nuclear family {father, tmther, 
8on, daughter, brother, sister, husband, wife) should share a position In 
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th(s count. Functors, as is the case with numbers, are Important to the 
language, but they displace and minimize the Importance of the other sub- 
stantive form classes because of their overwhelming prominence In natural 
languages. Foreign language instruction has typically met this difficulty 
by sub-dividing the lexical units of the language Into separate form classes. 
Such form classes are fundamental to any description of a language. They 
function at elemental levels In both phrase structure and transformational 
rules. The speaker of a language only rarely can make explicit the category 
rules which define such grarnnatical classes and, even in these rare cases, 
such expiicitness is typically incorrect. However, the speaker does use 
such rules in the construction of any utterance, his inability to provide 
an explicit account of the nature of those rules is not evidence against 
their functional utility. If the speaker Is given a contextual frame which 
calls for a unit from a particular grammatical class, the speaker can pro- 
vide an appropriate completion. Further, the choice of the particular 
completion within tha'c functional class is apparently determined by the 

0 

frequency of experience of that unit. Thus, ellcitatlon procedures which 
call for grammatical class associations in specified frames simultaneously 
solve two problems otherwise encountered in frequency counts: 1) all token 
frequencies are automatically marked by function class and 2) frequency 
determinations of uni; types are separately determined within function 
cla-'s, thus increasing the pay-off yield of the data collection. 

For the models of continuous language samples to be reviewed, the 
issue of corpus length is as crucial as it is difficult to answer. Rapoport 
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(1965) addressing himself to this problem with regard to speech samples has 
argued th^t: 

"In the selection of speech to be analyzed, the question of how long 
the transcript should be, though practically Important, Is not easy to 
answer. Intuitively It would be nice to have very long transcripts, 5000 
words or more. In order to get a substantial sample of the subject's vocab- 
ulary. Practical considerations, on the other hand, call for smaller 
samples. In addition. It might not be feasible to obtain very long samples 
of connected discourse from the subject. Without considering some excep- 
tions, people usually do not utter 5000 words and more In one session on 
the same topic. It seems that a proper solution to the length of the trans- 
cript is an empirical one. Sample sizes should be considered within the 
range where the mathematical form of the observed distribution of word-fre- 
quencies Is not markedly changed." 

And then after reviewing data similarly collected by Howes and Geschwlnd 
(1962) who claimed that: "These data show that even for samples of 1000 
words, there is excellent correspondence between the theoretical equation 
and the empirical distributions. The considerations suggest that for most 
purposes samples of 2000 words are adequate for estimating parameters of 
(spoken] ktj'd- frequency.", Rapoport concludes that: "It thus seems that 
the sample sizes used (in the Rapoport study] (between 1000 and 5D00 tokens) 
are appropriate." 
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Word Frequency Comparisons Among First ^»00 Entries of Three Counts 

Ku^era-Francis American-Heritage 
TYPE F TYPE F 



BlacK-Ausherman 
TYPE F 
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As substantiation of this conclusion, an inspection of the following 
analysis Is revealing. The accompanying tables compare the occurrence 
frequencies of the first kOO lexical types as obtained from the vocabulary 
counts compiled by Kulera and Francis, Blaclc and Ausherman and Carroll 
(American Heritage) for adult written, spol^en and children's texts of 
English, respectively. Starting with the KuJ(era- Franc i s count as compar- 
ative basis, each of the remaining counts has been reordered to correspond 
to that count and so that any word type not within the first kOO entries 
of the comparislon counts was deleted from the print-out. The three counts 
represent the most extensive and up-to-date counts of their respective types. 
The Ku^era-Francis and Carroll tabulations are based on counts of more than 
one million and five million running words, respectively. The Black-Ausher- 
man count of spol<en English is based on a data base of some 288,000 total 
words. The three counts, thus, represent samplings of spoken, written, 
adult and child language and display a broad range of both stylistic and 
content differences. The first ^00 types of each count, respectively, 
account for 60, 6^ and 79 percent of their totals. It will be observed 
that with the exception of the word, YEARS, which is not within tlie first 
hOO words of the Black-Ausherman count, the first 100 entries of the Ku^era- 
Francis count are matched by identical occurrences In the other counts. 
But more importantly, the order of occurrence of these matches is remarkably 
similar. Pearson-product moment correlations of the frequencies of these 
items among the three lists are all in excess of .95. In fact, even when 
the correlations are taken over the entire ^00 word types, the correlations 
among the lists are still In excess of .65. Thus, despite the differences 
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which can be expected to accrue to these differing versions of English and 
their respective differences In corpus size, the occurrence and order of 
the highest occurring types Is substantially unchanged. It Is also clear 
that these highest frequency Items are nearly uniformly functors. Where 
differences exist In the counts, those differences with only few exceptions 
obtain for the substantive Items. As one proceeds deeper Into the frequency 
lists, although In absolute terms still barely Into the total number of 
types (86,7^1 for the American-Heritage count), one Increasingly encounters 
greater and greater Ideational influences reflected by the differences In 
the data sources. For example, the Black and Ausherman count used military 
personnel giving extemporaneous speeches as their data base, and words such 
as WAR, GENERAL. SERVICE and ATTACK should not be expected to occur In a 
count of the language of chlldrens' texts for which the primary colors, 
numbers and body parts would be expected. 

In addition to the question of optlm,al sample size, there still remains 
the question concerning the method of sampling. The samples must be suffi- 
ciently scattered with respect to subject matter so as to avoid the vocabalary 
biases Inherent In the Ideational clumping which characterizes language. Yule 
i\Skk) has specifically rejected the random strategy of sampling In favor of 
spread sampling. This technique spreads the sample as uniformly as possible 
over the whole range of the work to be sampled. Yule's suggestion was to 
select a sample of words from each page, the words being samples within the 
page unit taken either at random or from a continuous passage of a prespec- 
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If led numbar of lines. It should be observed that the technique which we 
have employed for the sampling of American television In this research is 
a spread sample based upon randomly selected contlnous segments of five 
minute duration. The procedure is quite straightforward. A clock acti- 
vates a tape recorder for a five-minute interval during each hour of total 
speech time. The specific five-minute Interval is varied In a psuedo-random 
fashion so that different five-minute segments are sampled at each hour. 
The technique for accomplishing this sampling is instrumental I y simple. The 
minute and hour hands of a normal clock coincide at a different locus during 
each hour of a day. The specific time of coincidence is given by the equa- 
tion: 

(0) h:5h + II 

assuming the clock is started with the hands at 12 midnight. Thus, for 
example, the first coincidence of the hands would occur at 1:05.3, the 
second at 2:10.8 and so on. As real time progresses through the day, the 
five-minute sampling segments precess further into the hours. In order to 
avoid this consistent precession, the clock Is randomly started at a dif- 
ferent clock time each day. 

Notationa l conventions . For convenience and consistency, the following 
notational definitions will be employed in this paper. A word token, i.e., 

any occurrence of a word unit, will be symbolized as N and the total number 

w 

of words as simply N, Let jC be the number of different words In the sample; 
I.e., the total types, and K,^ symbolize a particular word type. For any word, 
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i/t the symbol, V^, shall stand for the frequency of that word In the sample. 
The symbol, t, as subscript Indicating frequency will also be used to des- 
ignate the number of different words having the same t frequency. Thus, 
we may define the probability of a word type in a sample as the relative 
frequency: 

(1) p(W.)J^i/N 

or the probability of the words of given occurrence frequency as the 
relative frequency 

(2) p(n.)^ ^^N 

Since the notation must unambiguously refer to the frequency of a word 
type, (1) will alternatively be expressed as: 

(r) p(W.)J^N 

The fraction of the entire sample of types having frequency t will be des- 
ignated as: 

(3) 0= "-^K 

and the fraction of the sample made up of tokens with frequency i as: 

W 0 = ^Va^ = p(W.)n. 

t t 

A frequency ordering of the W words by i, such that larger values of i have 
higher rank, may be ordinal ly transformed by assignment of ranks, r, with 
ties In rank given the average rank position of the equal i frequencies. 
Thus, tlie relation: 

(5) H s C, where C Is a positive constant, 
expresses the regularity first noted by Zlpf of the relationship between 
rank and frequency of word types. 
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If values of are plotted as a function of t one obtains a distribution 
known as the typo-frequency distribution. Alternatively, if one plots in. 
as a function of i one obtains the distribution known as the token- frequency 
distribution, either distribution being a word frequency distribution. 

These definitions wilt suffice for most of the following discussion, 
where special symbols are introduced their definitions will be given at 
that time. 



Four Models of Word Frequency Distribution 

Four quasi -distinct models have been proposed as schematic represen- 
tations of the claimed regularities of word occurrences in natural lan- 
guages. These models are those proposed by Zipf, Mandelbrot and Yule and 
the lognormal distributions proposed by Herdan. Each has been proposed 
as the best schemaplric representation of the language observations and 
each has been critlzed on both schematic and empirical grounds. As we 
shall see, however, the lognormal model has received the most attention 
and, at least, at this date appears to be the more robust of the alter- 
native formulations. 



Zipf s "law" . It Is fair to say without severe risk that all of 
this began with G. K. Zipf (1935, 19^9).' It is also not very risky to 
say that his contribution is probably limited to his role as originator 

'in point of fact, the observation of regularity between a word's frequency 
and Its rank in a sample had been made by both Estoup (1916) and Willis (1922) 
before him. Zipf, nonetheless, was largely responsible for the subsequent 
proliferation of theory and research on the topic. 
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and not to his so-called "law," Zlpf s observation can be expressed 
simply. If each word type of a frequency sample is assigned a rank value 
corresponding to the decreasing frequencies of these word types and 
plotted as a bl- logarithmic function of those frequencies, the relation- 
ship will be roughly linear with slope of approximately -1. Zlpf saw 
this regularity as a natural consequent of the principle of Least Effort. 
According to this principle, the speatcer prefers "a small vocabulary thai 
wiii iparo the efforc rnvolved in selecting the exact words needed to 
encode his r-tessage" whereas the listener prefers ''a large vocabulary that 
will spare him the effort involved In determining which of several alter- 
native messages the talker Intended." The speaker was In Zlpf 's. terms 
driven by the Force of Unification and the listener by the Force of 
Diversification. Zlpf presumed that the equilibrium state constituting 
the resolution of these two opposing force, produced the rank- frequency 
relation. Whether or not such forces Indeed exist In either the Individ- 
ual or aggregate language user Is moot. But even If they were to exist, 
it is unlikely that they would supply an explanation of the particular 
form of the bl-logarlthmlc relationship between frequency and rank claimed 
by Zlpf. The most telling criticism which by now has become almost hack- 
neyed is that rank and frequency are of necessity lawfully related not by 
empirical observation but by definition. The interval scale of frequency 
when collapsed Into tne ordinal scale of rank necessarily Is a negative 
monotone of that rank. Thus, the primary observation captured by the law 
Is trivial. Nonetheless, It can be argued that there are an infinite 
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number of negative monotones which are possible under the definitional 
relationship between ordinal and interval scales. Although the bi-log- 
arlthmic transformation proposed b/ Zipf defines a specific aud unique 
such monotone, It has been demonstrated time and again, that the bi-Jog- 
arithmic transform provides a very bad fit to the data at either the high 
or low frequency tails of the distribution. The regularity only holds 
for the narrow middle range of frequencies. 

Handlebrot's distribution . Mandelbrot's (1953) publication of an 
information theory approach to language statistics constituted what amounted 
to 3 mathematically rigorous defense of ZIpf's observation. Where Zlpf has 
been vague, Imprecise and mathematically naive, Mandelbrot rigorously derived 
the ranic- frequency distribution from mathematical arguments based upon 
precisely defined quantities. Rapoport (I965) has provided an explicnion 
of Mandelbrot's model which because It cannot be equaled In clarity Is here 
presented. 

Consider the ranlt-frequency distribution, where the most frequent word 
has rank 1, the next most frequently occurring word has rank 2, and so on 
to the least frequently occurring word In the vocabulary of K words In the 
samp' a. Associated with each word Is Its probability of occurrence, 
p(rJ, r « 1, 2 , where r Is the rank of a specific word. 

Associated with each word there Is also a "cost" of producing It. The 
question which arises Is, If the assignment of probabilities to the several 
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words of vocabulary is und«»r the control of the user of the language, which 
assignment will minimize the average "cost" per word? The answer Is ob- 
vious - use the "cheapest" word all the time. However, by using the same 
word all the time no Information can be conveyed from the speaker to the 

listener. Hence, Mandelbrot suggests using Shannon's measure of information 

K 

(6) // « - Zp(r) log p(r), 

where p(r) are the probabilities of occurrences of the words. Formula (6) 
gives the amount of information per word in the sampJe of speech. This 
formula, together with that for "cost" to be given below, enables Mandelbrot 
to frame his law in a mathematical form. 

Now, if d fixed amount of information Is given per word, the question 
arises which frequemcy distribution will give minimum average "cost" per 
word. Or alternatively, given a fixed average "cost" per word, what will 
be the frequency distribution of the words which give;- maximum information 
per word? f '>t C(r) be the "cost" of the r-th word. The average "cost" 
per word is given by 

(7) C p(r)C(r), 

The problem Is then to maximize H subject to the constraints: 

K 

(i) Z p(r)C(r) « C; 
r»l 

K 

(II) Z p(r) = I. 
j«l 
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Use of LaGrange multipliers gives the equations 

(8) 3 

3p7^ [- Z p(p)togp(r) ♦X , £ p(r)C(r) *X JUpM] - 0, 
p F r 

r « 1 , 2, ... K, 

where Aj , are arbitrary multipliers. The system (8) solved for each 
p(r) , give 

(9) p(p) - 

where M Is the base to which the logarithms are taken. The constraints 
on the problem determine the values given to the arbitrary multipliers 
A| and A^. Setting 

Xj-I 

(10) B « M ^ , 
and 

(11) fl«-A,, 
it is possible to write 

(12) p(r) - Vlf^^^^^' 

In order to obtain ZIpf's formula from (12) It Is necessary to show that 
C(r) is a logarithmic function of the rank r. Mandelbrot does this by 
giving the equation for the number of words, N(C), of a given cost C as: 

(13) n(c) = crc-c,; ^ tuc-c^) ^ ... ^ n(c^Cq), 

where Cj, C^, ... are the "costs" of the Individual "letters." Fur 
large C the solution of (13) Is approximately 

C(r) loQj^r, 

where M Is the largest root of the equation 



(15) I M'^-g « I. 
^1 
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and U the same M of formula (6). Substitution of (14) Into (12) gives 

(16) p(r) ^ PH'^^^'V ^p^-B^ 

This is the proposed formula for the ranlt-f requency distribution. Mandelbrot 
reports that this formula is not valid for small values of r, and suggests 
other formulas. A more exact formula Is given by 

(17) prw « Q(rH-mr^, 

where Q and m arc constants. When B > 1 , and /? « fl, an Improvement of 
formula (16) is given by 

(18) prW » B(B'\)I^'Ur^Pr^. 

Formula (18), which according to Mandelbrot "turns out to be experimentally 
excellent" (196lc, p. I95), was dervled by Mandelbrot from some explrlcal 
considerations. Although not presented here, Mandelbrot has also shown 
that the type-frequency distribution may be derived from the same Initial 
assumptions. 

The critical assumptions In Mandelbrot's model Involve the notions of 
"cost" and "let»:er". By minimizing "cost", Mandelbrot establishes the 
relationship between it and frequency and In turn by defining that minimum 
In terms of "letters" as constituents of a "word" he derives the relation- 
ship between "cost" and rank. Thus, he is able ultimately to establish the 
desired relationship between frequency and rank through the conroon construct 
of "cost per letter." Although there is no essentia! requirement that these 
hypothetical constructs be defined In any way other than as specified by 
their mathematical definitions. I.e., schematically, there Is considerable 




utility In finding psychological justification ror them in an attempt to 
provide explanatory rather than descriptive adequacy for the assumptions. 
Mandelbrot has suggested several Interpretations for his notion of cost, 
particular among these being the time required to read a word. "Letter" 
In turn can be either phonemes or graphemes, the total cost of a "word" 
becoming the sum of the constituent element costs. These elements demarked 
by a unique element, e.g., space, then define word. 

Mandelbrot's model, sophisticated and rigorous as It Is, and notwith- 
standing the Interesting and potentially productive psychological Implications 
implied by the notion of cost, suffers from the same criticisms which have 
been applied to Zipf. The empirical data still do not fit the model for 
extreme values of i and the parameters of the distribution are still highly 
correlated. 

Yule's distribution . Using Yule's (\S2k) analysis of the distribution 
of the frequency of species within general classifications, Simon (1955) 
proposed that Yule's distribution could provide a model for the distribution 
of word types within frequency classifications. I.e., for type-frequency 
distributions. Again, following Rapoport's exposition which, in turn, 
closely parallels that of Simon (1955, 1'957), the iiiodel can be presented 
as follows: 

Consider a text that has reached a length of N words. First assume 
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that the probability that the (/V+l)-st word is a word that has already 
appeared exactly i times is proportional to in, - that is, to the total 
number of occurrences of all words that have appeared exactly i times. 
Then assume that there is a constant probability, a, that the (n+l) word 
is a new word - a word that has not occurred In the first N words. Given 
the above assumptions, Slnion derives the following distribution function 
for the number of words used exactly i times. 

(19) = n* 5rt,-pj-+ 1^ 

where is the number of words which appear only once in the sample, B is 
the Beta function and a Is a free parameter assumed to be constant for 
different sample sizes. 

It is shown (Simon, 1957. p. 151) that 

(20) n* - K . 

where K is the total number of types. Simon shows that when a Is small, 
equation (19) simplifies to 

For large i, an approximation for (21) is 

(22) n, K , 
^ .2 

which Is equivalent to the Zipf relation (equation 5). 

The first step in fitting the distribution expressed in equation (19) 
to word-count data is tu get an estimate for a, the assumed constant pro- 
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babllity that a new word will be added to the text. It would be possible, 
of course, to count and to solve equation (20) and all subsequent terms 
may be obtained by applying a recursive formula to the Beta function: 

^^^^ " 1 * i\'a)i 

There are three characteristics which are generally observed in type- 
frequency distributions, and which should be accounted for by any model 
for type- frequency distributions. First, it Is observed that type-fre- 
quency distributions are J-shaped distributions with very long tails. 
The tails can generally be fitted by the function 

where a, b, and m are constants. Simon (1955) shows that (19) fulfills 
this requirement. Another characteristic of observed type-frequency dis- 
tributions is that the parameter b in (2^) is generally very close to I, 
and m is vtsry close to 2. In this case, ilk) becomes 

which is the same as (22) when a, the constant. Is replaced by K, A 
third characteristic Is that the following relations: 

K 2 • 

(27) - 

^^^^ n, 3 • 
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seem to hold approximately true for observed type-frequency distributions. 
The same relations are easily obtained from (21), (nj - yff^" -y- . 

^2 ■ aTI+D ■ "T" • ^"^ ~ ■ i \ K * "T" ^' function suggested 

by Simon thus has all the above desired characteristics. 



The model assumes a Markovlan generator for which the probability 
of any particular word Is dependent upon the probabilities of the preceeding 
words. A generator whose states are determined by the proceeding states 
has high face validity as a model of a speaker. Unfortunately, however, 
Simon must also assume that the state sequences are ergodic. That Is to 
say, that the probability dependencies between states do .lot change over 
time. Simon, In fact, has confessed that "It is known empirically, at 
least for the most straighforward application of the model, that « (K) t the 
rate at which new words appear In text, is, in fact, not a constant but a 
slowly decreasing function of K," (Simon and Van Wormer, 1963. p. 20^). 
Hiron and Wolfe (I96A} have suggested a mechanism by which such dependencies 
might change over time for the lognormal model to be presented later in 
this paper, but It is difficult to fit such a mechanism into Yule's model. 



Simon argues that psychological justification or his model can bz 
established by assuming that language Involves the processes of association 
and imitation. He claims that the speaker's choice of messages Is deter- 
mined by Imitation of the messages used by other speakers and that the 
sequential dependencies of these messages are determined by the associations 




established by linguistic experience. Both processes might Indifferently 
be applied to any stochastic model of the speaker. Both processes undoubt- 
edly do have some Influence upon what might be called speech habits, but 
as a general model of the speaker all such stochastic models have been 
shown to inadequately represent the novelty and Innovation which make up 
the greater part of the speech process. Even were we to set aside these 
theoretical arguments, however, there would remain the fact that the 
hypothesized distribution in fact represents a rather bad fit to the 
empirical facts. Herdan (1962) concludes that "The discrepancy between 
[Simon's] theory and observation is such as to invalidate Simon's claim 
that his model fitted word- frequency distribution In the whole range of 
the variable." 



The lognormal distribution . Because of its particular pretl nance to 
the theoretical justification upon which our research rest^, the following 
earlier study by the author In collaboration with Ms. Sharon Wolfe (Hiron 
and Wolfe, ISdk) Is presented here In Its entirety. It provides evidence 
of the generality of the lognormal Ity of the vocabulary derived from word 
associations elicited as reponse to stimuli embedded in linguistic frames 
and hence of the validity of such a procedure for ascertaining vocabulary 
distributions as a substitute for those procedures normally employing 
continuous text. 



Among the several alternative theoretical laws of word- frequency 
distributions which have been advanced, the most recently has been the 
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suggestion chat such distributions conform to the class of skew normal 
distributions. Herdan (1961), notably, has brought together a series of 
studies the Import of which is to establish that the frequency distributions 
of the number of words sharing the same frequency of occurrence are log- 
normal, ilowes and Geschwind (I963) and Rapoport (1965) have used lognormal 
transformations of word frequencies with success In characterizing the ad 
libitum speech of aphasic patients and normals. And most recently, Carroll 
(1971) has used the Lognormal Distribution to characterize the extensive 
vocabulary samples complied In the production of the Amerlcan-Hert Igage 
Word-Frequency Dictionary. The present paper represents an attempt to 
investigate the applicability of the lognormal distribution to word-asso- 
ciation responses In a variety of languages when those responses are 
restricted to qualifiers. 



Kapteyn has shown that a random positive varlate, the change in which 
is determined by a random proportion of the momentary value of the variate, 
will be lognormal ly distributed provided the assumptions necessary for the 
central limit theorem are met. 



The probability of responses in a standard word-association task, from 
the viewpoint of the habi t-hlerarchy position, can be considered a positive 
varlate which is the outcome of a discrete random process. Assuming. In 
addition, the existence of some factor or factors operating to produce mo- 
mentary changes in this distribution. It Is reasonable to expect that the 
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resultant overall rrobablllty distribution should be lognormal In form. In 
principle, the change effect was originally postulated by Hull as influenc{> 
the momentrary response probability through the operation of an oscillatory 
mechanism, It therefore appeared appropriate to Inquire whether the 

probability distribution of word-associat Ion responses could be shown to 
conform to the hypothesized distribution, with a view toward Identifying 
the response analogues of the necessary conditions for the genesis of that 
distribution, if It were appropriate. In addition, It was felt that the 
parameters of such a distribution might reflect certain aspects of the 
linguistic habits of Ss from different speech communities. 

METHOD 

Subjeots, One hundred males of high-school age in each of 12 
linguistic communities were used In the study. All Ss were nominally 
roooollngual speakers of the mother tongue of the community of which they 
were resident. Their ages ranged between 13 and 17. The 12 languages 
and places of origin of the data comprislsng the sample were as follows: 
Afghan-Farsi (Kabul, Afghanistan); American-English (Decatur, Illinois); 
Arabic (Beirut, Lebanon); Cantonese (Hong Kong); Dutch (Amsterdam, Holland); 
Finnish (Helsinki, Finland); Flemish (Brussels, Belgulm); French (Paris, 
France); I ranlan-FarsI (Tehran, Iran); Japanese (Tokyo, Japan); Kannada 
(Mysore, India); and Swedish (Uppsala, Sweden). 

Teating Instrument, A standard testing procedure for eliciting qual- 
ifier associations to each of 100 stimuli was devised and appropriately 
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modified for testing in each of the sample languages. The Ss were told 
to place each of the 100 substantive stimuli in a common frame sentence 
and to complete the frame by supplying a single qualifier which in their 
judgement would appropriately fit the frame. For the English-speaking Ss 

the frames were: "The BUTTERFLY." and "The BUTTERFLY is 

•" Both frames define Fries (1952) words of Class 3. The par- 
ticular test frames or frame varied from language to language as the syn- 
tactic requirements of qualifier distribution varied. 

The 100 substantives were selected from a pretested pool of 200 items. 
The original 200 items in turn were drawn in part from a list of items used 
in glottochrono logical investigations purported to be of wide linguistic 
applicability from the Kent-Rosanoff list and from category headings used 
by the Human Resources Area Files index. 

Testing Procedure, Testing was carried out in the class rooms of the 
schools, Ss being run in large groups comprising the normal class. Com- 
plete testing for the lOO-stimuli final list required approximately one-half 
hour. The Ss were instructed to attempt to supply an associate to all items 
but to omit any items with which they experienced inordinate difficulties. 
ThQ lOO-item final stimulus list was derived from the tests administered In 
Finnish and English for the 200-ltem lists, the procedures being Identical 
to those used in later tests except for the increased length of task in 
these two countries. 
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Scoring Criteria, All associated responses were Inspected by native 
speakers of each of the languages for non-conformity to instructions. Re- 
sponses Judged not to be admissible within the test frames were discarded. 
Such discarding of responses, however, was kept to an absolutely small 
limit, doubtful Instances being retained. All grarmatical inflections and 
orthographic variants were regularized to a single consistent form, with 
only difference in root forms being considered instances of separately dis- 
tinct responses. Combinations of free morphemes, multiple-word or phrase 
responses and neologisms were accepted. In all instances the assumption 
was made that S's response was acceptable unless the response in question 
was clearly deviant from minimal usage standards. 

Method of Analysis, Each qualifier type has an associated frequency 
of occurrence representing the total number of occurrences of the type a- 
cross all Ss over all stimuli. The types can be classified by occurrence 
frequency: category i contains all n. types which share occurrence frequency 
f^. It was hypothesized that the distribution of the random variable 
which takes on values Is lognormal; i.e.. If the variable X " \og F is 
Introduced, the distribution of X is normal. This is expressed in the 
equation P(X <f» log f^) » <t) (^\QgfJt where 0 is the standard normal cumulative 
distr iDutlon, f. is a particular occurrence frequency, and a, ^ «- (iwy/.-u/o). 
Thus, the probability of obtaining a category with frequency of occurrence 
500. for instance. Is given by 0 (sj^^ ^^^) - 0 {2^^^ ^g^) . Another way of 
Interpreting this statement is to say that the probability associated with 
occurrence frequency 500 is simply the proportion of types which are expected 
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to have occurrence frequency 500. The empirical estimate of this probability 
In turn, Is simply the number of types actually occurring with frequency $00 
divided by the total number of types. 

These empirical estimates of the probabilities were used to obtain 
least-squares estimates for y and o, the parameters of the normal distrl- 
bution. Fcr category 2^^^ ^/^t ^" estimate of the cumulative 

probability, which can then be transformed with the aid of standard normal 
tables to a s-score. This transformation in turn yields a set of empirical 
s-scores, where » Hog/*^ ~ m) /o. This equation is linear In both of 
the variables, Z and X » logF; accortjfngly the least-squares solutions to 
the general linear equation Z a mX -t- k were obtained such that m » l/o 
and k a -p/a. Once the least squares solutions p and o were obtained, 
predicted a-scores, Z ^ (X ^ \i) /a were calculated, and the predicted cumu- 
lative probabilities ^ (Z) were obtained from normal tables. 

It can be shown that if the distribution of occurrence frequencies for 
types Is lognormal, the first moment of this distribution defines the dis- 
tribution of occurrence frequencies for tokens. The total number of tokens 
in category i Is simply the product of f . , the number of tokens for each 
type, and n^, the number of types sharing this occurrence frequency. Here 
again, P(X < \ogf^) « 0 (z^^^^ )* However the estimate of 0 here is given 
by L ^V'O' ^0^3 1 number of tokens occurring in categories of occur- 
rence frequency /. or less, divided by the total number of tokens, En./*.. 

mi mi 
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The least-squares estimates for u and o for the token distributions were 
calculated by the same method as for types. 

As a further test of the lognormality of the distributions for types 

and for tolcens, it can be shown (see Herdan, 1961) that the jth moment of 

a lognormal variate with parameters u and o is also a lognormal variate 

2 

with parameters \i. « i'- j*o where logarithms are taken with respect to 
the base e. For this study, with X » logf, the mean of the jth moment 
becomes p^. * log^ . Thus the variance for tokens should equal 

the variance for the types distribution, and the means should show the 
relationship expressed above; since j * 1 in this case, the equation becomes 

^okens■^ypes*»*»9.»0•^^ypes• 

RESULTS 

Figure ! displays the primary results of the foregoing analysis for 
each of the language samples. If lognormality holds and if the /. are 
plotted against cumulative proportions on lognormal paper, a straight-line 
graph should result. Although none of the extant significance tests commonly 
employed for estimating goodness of fit is entirely appropriate for functions 
of this kind, inspection of the figures clearly indicates sensible linearity 
tor a major proportion of the transformed empirical points. Correlation 
co-efficients computed between preoicted and obtained 2-scores ranged be- 
tween 0.900 and 0.999 for the type distributions and between 0.900 and 0.998 
for the token distributions. Since the squared correlation coefficient gives 
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the proportion of variance of the obtained distribution accounted for by the 
predicted distr lbut!on» the fit would seem to be remarkably good. It Is 
still entirely possible, however, to find that the remaining variance, 

« 

despite Its small size, Is large relative to the error and hence significant. 
At least three considerations, however, militate against attempting to test 
the significance of the departure of the data from the hypothesized distri- 
butions. First, the values of the occurrence frequency variable cannot be 
considered Independent, Second, there are pronounced end-effect distor- 
tions in these distributions due to the finite size of the sample and to 
the finite and variable step Increments of the occurrence frequencies, AUo, 
the precision of estimates of the probabilities for each occurrence frequency 
is necessarily greater for categories containing large numbers of responses 
than for those categories containing few responses; accordingly the various 
occurrence-frequency categories should not be given equal weight in estimating 
departures from lognormal 1 1 ty. The third argument Involves the logic of 
significance testing. Given the impressively large proportion of variance 
accounted for by the hypothesized distribution, it does not seem reasonable 
to test the unexplained variance. For practical purposes, the best estimate 
of the position of an undetermined point would be that provided by the, 
parameters of the fitted curves, the least-squares estimates of which are 
displayed in Table 1 . 



it will be observed that the language distributions displaying greatest 
differences In slopes (variances) are those for Afghan Farsi and French. Of 
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all the languages^ Afghan Farsi exhibited the flattest slope (highest vari- 
ance) for both type and token distribution^^ indicating for this group of 
respondents that responses tended to be evenly distributed across occurrence 
categories. The French data» on the other hand» tended to have greater 
variation In numbers of responses in the different f requencyof-occurrence 
categories; i,e.» the French distributions displayed lowest variance and 
steepest slope. Inspection of the \x estimates indicates that the difference 
!n variances of the distributions for these two languages is apparently 
attributable^ in part, to the greater number of single-^occurrence responses 
In the French data. 

The preferential envnission of either low* or high-frequency qualifiers 
undoubtedly has its basis in either or both linguistic and cultural charac- 
teristics of the speech community. Samples with high mean occurrence fre- 
quencies reflect a preferential usage of qualifiers elicited from most of 
the other respcndents and stimuli, whll^ ^^-^^nples with low means reflect 
predo!n!nate usage of qualifiers idiosyncratic to the individual or to par- 
ticular stimuli. Given equal variances, therefore, those samples with 
highest mean occurrence frequencies can be characterized as exhibiting 
greater stereotypy of response than those with lower means. 

Since stereotypy seems to be a function of both parameters, a working 
definition for this concept is given by the ratio y over c as in Table 1. 
These ratios Indicate that the type distributions for Afghan Farsi, English 
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Fjg. 2 (A and B). Co-variuiion of Ihv cMimaiiii loKnormal paramclcrs of ihe qualifier 
type distributions. Note: Language families in caps. 
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and loki-n 00 di^lrihulii^n^ N<?tv. i anuuauc f;imi!u> in c.ip- 
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Cantonese, and Dutch respondents have lowest stereotypy In that order, and 
the Japanese, Kannada, Arabic, and French respondents displayed greatest 
stereotypy in that order. For the token distributions the respondents with 
least stereotypy in decreasing order were: Afghan Farsi, Cantonese, Dutch, 
and Iranian Farsi. Those with greatest stereotypy in decreasing order were: 
Japanese, Kannada, Arabic, and Finnish. 

Hartley's test of the differences in variances and an analysis of 
variance applied to the differences in means between language samples 
indicated that both parameters of the individual lognormal distributions 
differentiated among the languages when considered in the aggregate. The 
joint parameter variations for the type and tol^en distributions are dis- 
played in Figures 2 and 3> respectively. 

Since the tolcen distribution should represent the first moment of the 
distribution for types, the two lines for a given language should be parellel 
(i.e.. have equal variances), and separated by a distance of log 10*0 . 
The M value for tokens and that computed from the relationship of the 

moments of the lognormal distribution specified by the equality Mi^M^'^log^ 

2 2 
iO'a ranged betweem 0.76 and 0.05. The value for o which was used in 

these computations was obtained by averaging the variances of the type and 

token distributions for each language. Due to the absolutely small magni" 

tude of the standard error, all of these differences are significant beyond 

the \% level, both for the individual languages as well as for a\\ languages 
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in aggregate, when tested by t. Thus, although the magnitude of the sepa- 
ration between the means closely approximates that predicted by theory, 
significance tests indicate that these departures from predicted values are 
significant.. 



Previous successful applications of the lognormal transformation to 
word- frequency distributions have been made on nouns, function words, and 
all word occurrences in running texts (see Herdan, 1961). Apparently we 
may now add qualifiers as obtained from a restricted word association 
procedure to the growing list o^ such successful applications, in view 
of Che compar^ibi 1 ity of occurrence-frequency estimates obtained in word- 
association procedures and other methods of obtaining word counts it would 
have been surprising to find lack of fit in this study where other investi- 
gators have demonstrated lognormal ity of the frequency of various word 
classes obtained from running texts. 

No compelling linguistic explanation for the obtained ordering of the 
groups in terms of their joint parameter variations appears to exist. Al- 
though three of the low-stereotype languages are of Indo-European origin, 
the presence of a Sino-Tibetan language (Cantonese) in this group makes it 
unlikely that language-family differences alone will suffice. On the other 
hand, there is a clear orderly arrangement of the joint parameter variations 
for the languages of the Indo-European family. The values of y and a for 
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the Indo-European languages are reasonabty well fitted by a linear expression. 
In order to explain the variations within language family, however, it is 
necessary to Invoke some additional explanation, it Is possible that these 
variations might be explained by the nature of student-teacher Interactions. 
Variation In the formality of secondary school education would be expected 
to influence the amount of innovation the students perceive as permitted. 
It is of interest to note In this regard that the Japanese sample, for which 
a high degree of formality In secondary education Is a well-attested fact, 
exhibits the highest stereotypy values. 



The stereotypy groupings presumably must involve a difference In the 
relative habit strengths of the responses to the stimulus aggregate. For 
the low-stereotypy group, this would indicate that more nearly equal habit 
strengths are present In the various S-R habit hierarchies than exist in the 
high'stereotypy subject-stimulus group as a whole. The hlgh-stereotypy 
group would be characterized by one or more highly dominant responses, with 
the remaining responses of the divergent set being of dlsparately low 
probability. Such differences In the structure of S-R hierarchies can be 
attributed to differences In the predominance of certain linguistic con- 
ventions. If this is the case, we should expect to find that the hlgh- 
stereotypy languages should exhibit more conventionalized conformity In 
sequences of modifier-substantive usage In the language as a whole. 



If the response-frequency varlate of this study can be considered an 
Index of the probabilities of responses In the response hierarchy of the 
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aggregate subject-stimuli eombtnationSt and if these probabilities are sub- 
ject to random change » it should be possible to specify the nature of the 
change operation in such a way as to derive the lognormaiity of the variate. 
Let us hypothesize that the stimulus aggregate produces an ordered set of 
responses varying In probability of emission. Let us further specify that 
from moment to moment these probabilities are subject to change, that change 
being attributable to an hypothesized tendency on the part of S to condition 
his choice of responses on the basis of the responses given to the previous 
i terns on the list. 

Thus, although a given response may originally reside in a category 
of high probability of occurrence to the aggregate stimulus set, its pre- 
occurrence to one of the items of the stimulus set places it In a variably 
lower probability category. Subsequent repeated usage of the same response, 
in turn, places the response in increasingly lower probability categories, 
the extent of the probability decrement being variable across both individuals 
and responses. The proportion of reduction in probability of the re'emission 
of a specific response in a task of this kind Is considered to be a constant 
fixed for the individual by the linguistic habits of his community, the size 
of his vocabulary, and his individual sensitivity to repetition within the 
limits established by the community. Although the proportional probability 
decrements for responses are constant for any given individual, their dis- 
tribution across Ss is assumed to be randomly proportional to the probability 
of the last value of the occurrence- frequency variate. Accordingly, the 
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parameter variations of this study provide an Index of the average value 
of these individual-subject variations for the languages investigated. 

SUMMARY 

One-hundred Ss In each of 12 widely divergent linguistic convnunities 
were administered a standarized restricted word-association test consisting 
of 100 substantive stimuli. The Ss were instructed to provide a single 
response which conformed to the requirements of substi tutabi 1 i ty In a tost 
frame designed to restrict responses to qualifiers only. The total fre- 
quency of all unique responses, excluding grammatically inflected responses, 
was tabulated. Categories of equal frequency of occurrence were determined 
and the distribution of the number of responses sharing the same frequency 
of occurrence was plotted. It was hypothesized that these distributions 
should substantially conform to a theoretical distribution of the lognormal 
form, since many aspects of the word-association task have high similarity 
to the generative rules of the lognormal distribution. 

The obtained distributions were found to conform sensibly to the hypoth- 
esized distribution. An analysis of the variance explained by the lognormal 
equations of best fit to the transformed points indicated that very little 
variance remained unaccounted for by the hypothesised distribution. Accord- 
ingly, variations of the estimated parameters were examined for clues as to 
the nature of the processes these parameters might reflect. The concept 
of stereotypy of response was introduced and defined as the degree of re- 
sponse uniformity across both Ss and stimuli. 
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More generally, It was suggested that this stereotypy could be expected 
to be the result of stable linguistic conventions. The individual's respon" 
siveness to these conventions was assumed to be a function of his sensitivity 
to response repetition within the limits established by the speech community. 

The American-Heritage Project . The most ambitious test of the lognor- 
mality of word frequency distributions is that recently completed with the 
publication of the American-Heritage Word Frequency Book (1971). Under the 
direction of John Carroll, more than 5 million total words were sampled 
from approximately 1000 graded school texts and reading sources. This cor- 
pus, called the American-Heritage Intermediate Corpus (AHI), formed the 
source material for a successful test of the lognormal model. The entire 
project has resulted in what must be adjudged one of the most useful and 
elabrorate of statistical analyses ever completed. Carroll states: "As 
one inspects the data assembled here, many questions come to mind: How 
representative of tne total lexicon of English are the word types that are 
listed? How accurate and reliable are the frequency data? How do the 
vocabularies for the various grade levels and subject matters differ? 
What Is the effect of the word-unit chosen to be the basis of the fre- 
quency counts? 

To some of these questions it is now possible to give answers that 
are probably correct within fairly narrow limits. Many of these answers 
can be derived through tha analysis of the Corpus on the basis of a power- 
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ful statistical model of vocabulary that can be shown to account for the 
data In a surprisingly prescise way. This model, which apparently was 
first developed by G. Herdan, is called the tognormt model, because it 
postulates that the total vocabulary underlying a corpus is distributed 
according to the familiar 'normal distribution' when the logarithms of 
the frequencies are used." 

Having accepted the lognormallty of the AHI corpus, Carroll Is able 
to predict the probabilities of the word token and type occurrences in 
the assumed total population of the English language, in addition, one 
can determine the expected number of word tokens which will be accounted 
for by any given number of w«>rd types and the relative frequency of occur* 
rence of each. All of this is, of course, dependent upon the assumption 
that the lognormal model is, in fact, an adequate schemapirlc represen- 
tation of the data and that the data Itself is an adequate representation 
of the English language. The first assumption has received more empirical 
support than the alternative models which have been proposed for word fre- 
quency distributions. As this paper has tried to indicate, however, such 

support can never constitute a proof. A schemapirlc model of any domain 
can only be Judged in terms of the theoretical desiderata of parsimony, 
productiveness, explanatory adequacy and utility. What Carroll, Miron 
and Wolfe and others have acheived is a demonstration of the descriptive 
adequacy of such a model as an account of their data observations. The 
results of the remaining tests of that model's adequacy still must be 
considered to be only tentatively suggestive. 
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The second assumption clearly has some limitations in the Carroll 
test, as Indeed, It does In all of the other tests, although in differing 
ways. The AHI corpus is based on a sampling of texts typically employed 
by third through ninth graders. This corpus produces a theoretical expec- 
tation that the English language contains a total of 606,906 word types 
as estimated from the empirical occurrences of the 86,7'il types actually 
appearing In the AHI sample. If this population estimate is to have any 
use, it necessarily implies that (I) new words which enter the language 
must displace old words, (2) that the population growth rate of English 
is fixed; i.e., that the birth rate exactly equals the death rate of words 
and (3) that a lexicon of this size will exactly specify the "character" of 
English within any specified and arbitrarily small value of precision 
approaching zero as a limit. At the least, one would want to hedge one's 
faith in these implications by the caveats that it is (1) the child's 
English which is being addressed, (2) that "type" has been defined by 
orthographic pattern (e.g., word, words , wording. Word are distinct type 
entries in the AHI) and that (3) the writer's of books for schools undoubt- 
edly have already assumed a limitation on the vocabularies of their readers. 
Nonetheless, the AHI analysis represents the closest approximation yet 
achieved to a precise specification of the vocabulary characteristics of 
English, the caveats notwithstanding. 

Considering the expected users of the AHI data analyses, the exposi- 
tion of the procedures is extraordinarily complex. But, if the exposition 
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is difficult to follow, the editors can find excuse in the difficulty of 
their task* Consider the following expository table ( American-Heritage 
Word Frequency Book, TABLE D"2, page 3) illustrating the calculations of 
F, the frequency of occurrence; D, the diversity or dispersion of occurrence; 
U, th£ estimated frequency of occurrence per million tol^ens adjusted for 
diversity of occurrence; and SFI, the standard frequency index of type 
occurrence per tol^en based upon these prior calculations. 

SFI corrects the occurrence probabilities for dispersion across the 
differing content samples which make up the AH I corpus. This correction 
employs the Information theory measure of uncertainty and weights more 
heavily those word types which are more nearly equal in frequency across 
differing content samples. SFI, in turn, is related to occurrence pro- 
bability by the relation: 

(28) SFI IO((log P)+10) 

For our purposes, it is the logncrmaiity of the probabilities of 
the token and type occurrences v^hich will be considered. Carroll's Figure 
B-3 (American-Heritage Word Frequency Book, page xxv) graphically represents 
the theoretical cumulative word- frequency distributions which best model 
his data. The cumulative proportion of the total type and token distributions 
are plotted on the ordinate as normal deviates and the log probability of 
occurrence along the abscissa. 
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It is from this model that one can asses the relationship of number 
of occurrence tokens to number of occurrence types. Assume that we wish to 
estimate the theoretical number of types which would be required to account 
for $0% of all of the token word occurrences in English. Entering the 
ordinate of Figure B-3 at the value of zero which corresponds indifferently 
to ihe meon, median and mode of the symmetrical normal distribution, we find 
that this cumulative proportion corresponds to a theoretical tol^en probability 
of .0^917 (antilog of -2.6917 « .0^917). That is to say, we should expect 
to find ^0% of all words of English occurring with frequencies of k or 
greater per 100 viords. 

If we now enter the abscissa at a value which corresponds to this 
probability and find the corresponding point on the ordinate which inter- 
sects the type distribution at that abscissa value, we can ascertain the 
cumulative proportion of types which would have probabilities of ^ or 
greater occurrences per IJO words. Or stated otherwise, the ordinate 
value of the type distribution corresponding to tlie 50^ point uf the 
cumulative tcken distribution represents the numb&r of types theoretically 
represented in 50^ of ail token occurrences. The cumulative normal value 
in this instance is approximately 3*7 as estimated from the graph. A 
normal deviat^} of this magnitude means that approximately .0005 times 
100 percent of all types have occurrence probabilities equal to or greater 
than the value of .0^917. Assuming the theoretical totol number of types 
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in Engtslh to be 606,906, we determine that approximately 300 word types 
should theoretically account for 50% of all word occurrences (.0005 times 
606,906). 



It one were to attempt to account for 95% of all word occurrences, the 
same procedure would result In an estimate that some 10,000 word types 
would be required. And for 991, i»i»,000 types would be required. In each 
instance, it is the highest frequency types which must be selected and the 
best estimate of which particular types these may be is derived from their 
ranks in the sample as determined by their respective CFI values. 

Concluding Observations 

As we have indicated, all of the foregoing is as weak or as strong 
as are the assumptions upon which the models rest. It Is our belief and 
the substance of our reconsnendation that the elicltation procedures which 
we have outlined ^nd which form the basis of our research have strong 
justification for their assumptions. Further, It would seem that those 
procedures largely obviate the conceptual abstractions of the data which 



Carroll, for example, is required to make in order to satisfy the assump- 
tions of the model he employs. We require only the assumption that the 
speaker of a language will "spew" his vocabulary in an order which is iso- 
morphic with the probabilities of those vocabulary items in the language 
he speaks. In order to select those items which have greatest utility 
over as wide a linguistic context as possible, we conceptually abstract 
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the notion of high information over stimulus environments within a form 
class frame* This is the equivalent of the abstraction which Carroll's 
SFI makes with respect to content sources. It differs in that in our re- 
search we have defined informational uncertainty in terms of differences 
across the speakers of a defined subject population rather than across the 
authors of differing content texts. 
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SECTION IV - BIBLIOGRAPHY 



1. Aborn» Murray^ and Rubenstein» Hubert Word class distribution in 

sentences of fixed length. LangMage , 1956, 21» 666-67^ « 

'^Samples of printed English sentences of three lengths were drawn 
randomly from a representative selection of popular magazines. 
The words of each sentence were classified according to Fries ^s 
system and a count was made of the various word classes at each 
sentence position. The tabulations were plotted as frequency 
classes and treated stdttsticdl )y . Principally, the data obtained 
from all three sentence lengths indicate (I) thai the greatest 
variations in word-class frequency tend to occur In sentence 
extremes and the Immediately adjoining positions, and (2) that 
different word classes have characteristic patterns of variation/^ 

2. Aborni Murray, and Rubens teln, Hubert Perception of contextually 
dependent word-probabilities. American Journal of Psychology , I95fi» 
21, '♦20-A22. 

^'Eighty subjects were instructed to write down the first eight 
words coming to mind which could replace the missing word in 
a sentence, and theo to rank these eight words in order of 
decreasing likelihood of occurrence In the sentence. The 
findings both for long and for short sentences may be summarized 
as follows: (I) words perceived as being more probable in a 
given context tended to be those actually occurring with greater 
probability in that context; and (2) greater agreements among 
subjects' responses were exhibited in the case of words perceived 
as more probable than in the case of words perceived as less 
probable. Together with the work of Zipf, these results suggest 
the following generalization: In contexts of low constraint, 
the number of different probabilities perceived is far less 
then the number of possible alternatives.*' 

3. Allen, J, The Swahlli and Arabic manuscripts on tapes . Leiden, The 

Netherlands: E. J. Brill, 1970. 

This is a compilation of unpublished literature in Swahili. It 
states the earliest found manuscript in Swahlli ib dated 172^. 
The scope Is largely Swahlli written In Arabic script. Part 1 
Is a serial list of holdings with descriptions; first those in 
Swahlli and second those In Arabic. Part 2 is an Alphabetical 
list of Swahili manuscripts by titles and first lines. Tapes 
Include verse and prose examples. 



185 

4 -f < >^ 



ERIC 



^. Allen, W. Living English structu re (pr actice book f or foreigners) > 

London: Longmans, Green, and Company, Ltd., 1949* 

This book is an empirical approach using a series of 15 exercises 
which drill English structure Into the student. The exercises 
are graded according to difficulty, as elementary. Intermediate, 
and advanced • 

5« American Mathematical Society Structure of language and its mathematical 
aspects* Proceedings of Symposia in Applied Mathematics , Providence, 
Rhode Island: American Mathematical Society, 1961 (2d Printing, 196^), 
12. 

This Is a compilation of 20 studies on the subject of linguistics, 
logic, and mathematics by well**known experts In these fielas. Of 
particular interest is the chapter "On the Theory of Word Frequencies 
and on Related Markovian Models of Discourse" by Oenolt Mandelbrot 
(pages 190*219)* Mandelbrot's chapter (article) treats a variety 
of topics related to the roodels for the law of word frequencies by 
Estoup and Zipf. It discusses diachronlc and synchronic aspvects 
of the model. It also contains a criticism of certain attempts to 
apply lognorma) probability distrubutlon to data on word frequencies. 
The final part is a discussion of linguistics and the role of sta** 
tisticil and other envitmerat ional laws, such as the Willis Species- 
Genera relationships. 

6. Ashen, R. Language^-an enquiry Into its meaning and function . New York: 

Hayes and Brothers Publishers, 1957* 

This is part of the Science and Culture Series. There are 19 
chapters or articles by different authors, except that the first 
and 19th chapterj are by the compiler himself. These two; 
'V'.anguage as Idea^^ and ''Language as Communication'*, together 
with Chapter 9-^Roman Jakobson on the Cardinal Dichotomy of 
^ Langudgc**-have relevance for statistical analyses or vocabulary. 

7. Ash ton, E. Swahi I I grammar (including intonation) . London: Longman, 

Green and Company, Ltd., 1966 (I3th Printing). 

This text is divlv^ed into two parts: Part I, progressive lesi^ons 
with exercises is concerned with everyday conversational phrabC5> 
and literature and Part 2 which goes into more detail on each 
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7. (continued) 



grammatical topic but has fewer exercises. There is a key to the 
exercises after the last chapter. The last chapter concisely 
sumniarizes the highlights of the bool<. There is a vocabulary of 
of nouns and verbs used in the lessons and exercises at the end 
of the book, followed by additional situational exercises; e.g., 
"at the office' and "at the hospital". 

3. Bailey, 0. Glossary of Japanese neologisms . Tucson: University of 

Arizona Press, 1962. 

The purpose of this glossary was to collect in one place a list 
of useful new words and phrases not found in Japancse^Engl i sh 
dictionaries, specifically Kenkyusha's New Japanese-tngi ish Dic- 
tionary of 195^* It includes proper nouns of considerable use, 
other useful v/ords overlooked in the referenced dictionary, and 
a list of Japanese words in current use not found in Kenkyusha's 
new Japanese-English Dictionary. 12,000 candidate words were 
narrowed down to some 6,000. Sources of words are: "Basic 
Information on Current Words 1959*62", "Dictionary of Newspaper 
Terms I960", "Handbook on Words in Current Use 1961", and "Dic- 
tionary of National Language I960". 

9. Bailey, .Richard W., and Burton, Dolores M. , S.N.O. Engl ish styl istics: 

a bibi iography . Cambridge, Mass., and London: in6»i. 

A collection of over 2,000 items concerning general styl istics 
and style in English and American literature since 1500, the 
work is divided into three main sections: bibliographical sources, 
langijage and style before 1900 (including works on styl istics in 
antiquity), and English styl istics in the Twentieth Century. 

10. Bakaya, R. M. A.-> experiment in compiling a minimal vocabulary for 
reading scientlf ic**technicdl literature in Russian. Babel , I3 ,t lD&7f 
163-168. 

The pLTpose is described as providing a minimdl rer)ding or recep** 
tlve vocabulary for scient i f fc-technica 1 llt.erature in Russian, 
The method was to campile the vocabulary from a comparative study 
of n^ne existing word lists, in general selecting those words 
which occur on at least three of the lists, resulting ir a Min- 
imal vocabulary of some 3500 v/ords which was then checked against 
three scientific texts with the result that the Bakaya list 
covered more than 95 percent of the texts directly or Indirectly, 
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II. Baker^ Sidney J. The pattern of language. The Journal of General 



Psychology . 1950^ ^2^1 25*66. 

After an extensive summary of previous Investigations of lexical 
data both within and across linguistic boundaries, the author 
reports his calculations of word length (by letters) in several 
word lists. Me concludes with a discussion of Zipf's law and 
polysemy . 

12. Baker, Sidney J. Ontogenetic evidence of a corrleation between the 

form and frequency of use of words. The Journal of General Psychology . 

1951 • 235-251. 

Baker examined a 'fOiOOO word collection of letters written by a 
paranoid schizophrenic and compared the rank-'f requency distri* 
button of words in the letters with similar lists published by 
Horn and Throndike. 

13* Bar Hillel, Y. Logical syntax and semantics. Language , OS'*! 30* 

(Bobs-Merrill Reprint L-3.) 

A good part of this article is devoted to refuting Zellig Harris' 
(Methods of Structured Linguistics) contention that most consid* 
er«^tions of meaning in linguistics can be satisfied by distri- 
butiona) procedures. Bar iiillcl cites that most structural 
linguistics have recognized that not all aspects of linouistics 
can be handled by distributional analysis alone, in spite of 
Harris' thesis that he can explain synonymy and active-passive 
relationships. Bar Hlllel attacks what he believes is Harris' 
basic assumption that "any two morphemes having different meaning 
will also differ somewhere in distribut ion»" He says that by 
extension of this staterrent that many of the transformational 
aspects of language, If not all of them, could be redu'^.ible to 
the formational a'^pects. However, Bar Hlllel says this is not 
true. He then proceeds to elaborate. If he dislikes Harris' 
thesis, he does agree with Rudolph Carnap who believes that 
logical analysis has an equal place with distributional analysis 
and that modern semantics must also be considered. 



1^. Barber, C.L. Some measurable characterisitcs of modern scientific 

prose. Contributions to Engjlsh syntax and philology, ed. Frank Behre, 
Gothenburg: Adler, 1962, 2]-k}. 
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U. (continued) 

The author is primarily concerned with Identifying features of sci- 
entific language that will constitute particular difficulties for 
non-native speakers of English. Me gives particular attention to 
clause and verb phrase structure and to the identification of words 
that appear frequently In scientific writing but not in the general 
vocabulary of English. 

15. Barl^er, Muhammad Abd-a I -Rahman, An Urdu newspaper word count . McGllI 

University Institute of Islamic Studies, 19^9. 

"This volume Is the last of four works dealing with the Urdu 
language prepared by the Institute ot Islamic Studies, McGill 
University. The present volume, although not intended primarily 
as a dictionary, is suggested as a supplementary vocabulary 
source for further reading and research. The corpus upon which 
this work Is based contains 136, 783 running words, collected 
from 15 Pakistani newspapers. The author's rules (which differ 
somewhat from those of Brill and Landau), as well as a discussion 
of word counts, the corpus of this work, word count methodology, 
Arabic orthogriphy, and other pertinent Information, are presented 
In the Introductory section. Part One comprises the Urdu-English 
Alphabetical List, which gives the orthography, frequency, pro- 
nounciation, grammatical class membership, meaning, and usage of 
each lexeme. Part two, the Frequency List, relists all occurring 
words in descending order of frequency." 

16. Barth, Gilbert R&cherches sur .a frequence et la valeur des parties du 

discourse en frangais, en anglais, et en espagnol . Paris: I96I . 

A statistical study of the degree to which the nhree languages 
exploit the possible combinations of word classes. 

17. Becker, Selwyn D., Bavelas, Alex, and Braden, Harcia An index to 
measure contingency of English sentences. Language and Speech , 1961, 
^, I39-U5. 

"Several indexes to measure contingency of sentences were con" 
structed by considering nouns, repeated nouns, and total number 
of v^ords. Contingency was operationally defined as reconstruc- 
tiblllty in order to test the several indexes against a criterion. 
The best form of the index was then selected and retested. The 
contingency ranking, based on the index, of ten sections of text 
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(continued) 

correlated 0.8^ with the reconstructibi 1 i ty ranking. It was 
concluded that the index is a valid initial approximation to 
a measure of contingency if contingency is defined as recon- 
struct ibi I ity.'' 

Becker, Selwyn and Carrol, Jean The effects of high and low 
sentence contingency on learning and attitudes. Language and Speech , 
1963, it, ^6-56. 

^'By a logical analysis it v^as shown that the sentence contingency 
is roughly equivalent to Shannon's measure of redundancy. In two 
independent experiments it was demonstrated that a signficantly 
greater number of multiple choice questions are answered by those 
who study text characterized by higher sentence contingency, or 
redundancy. The findings were compared to those found in investi- 
gations of the effects of redundancy on words and syllables. Data 
from a third experiment provided support for the conclusion that 
preference for text material is also related to sentence contin- 
gency.'' 

Beier, E. G. , Starkweather, J. A., and MiJer, D. Analysis of word 
frequencies in spoken language of children. Language and Speech , I967t 
10, 217-227. 

The purpose of the study was to establish certain base rates In 
the language usage of children and to investigate sonc of the 
psychological significances of those base rates. The authors 
wanted to know whether their data v/ould support Zipt*s Lav^s, in 
particular whether in a given language sample, the nuni..er of 
different words would increase the r requency of occurrence 
uecomes smaller, and whether the fia^nitude of the words would 
tend to stand in inverse ratio to the number of occurrences of 
a given word. Additionally, the authors souyiit to determine 
which, if any, of a number of variables (such us the type/token 
ratio, word lists, magnitude of words, and the 10 most frequently 
used words) would differentiate the age groups. The experiment 
took place in Salt Lake City with grade school children. It Is 
not clear what stimulus materials were used, but the boys were 
told not to use prepared speeches and taught how to handle tape 
recorders. Each boy recorded about 5000 words from which about 
2700 were selected and compiled into two ^0,000 word corpora 
for a grano total of 80,000 words for both groups. Five, one- 
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minute samples of each boy's speech were recorded in order to 
obtain the rate of speaking in words'^per-mi nute for each. The 
results were manipulated ty an IBM 709^ computer and the two 
lists were compared by frequency between themselves and with 
the Eldredge Newspaper count of 1911. Each count had 42,000- 
43>000 words. The two children's counts each had about 3100 
different words and the Eldredge count about twice that many 
(6000). The results tend to indicate that the printed language 
has a greater variety of expression than oral, which others have 
suspected. However, In this case, since printed adult news- 
writing was compared to two grades of elementary school oral 
output, some of the difference In variety Is undoubtedly caused 
by the greater age and experience of the newswrlters. In fact, 
differences between 6th and 10th graders tends to show somewhat 
greater variety among the 10th graders. The 1 percent level of 
significance , and the date Indicate that older boys speak faster 
than younger ones, use significantly more positive and negative 
words, use slightly more singular sel f- reference, use slightly 
less plural self- reference, use more "other" references, and 
use slightly more "question" words. At equivalent Intelligence 
levels, the two age groups showed Insignificant differences In 
type/token ratios. Eight of the ten words In both boys lists 
were the same, although not In the same order. In the different 
words the 6th graders used "it" and "we", and the lOth graders 
used "not" and "do". In both groups. In general, shorter words 
tended to be used more often than longer ones. The authors 
hope to use their findings In developing a psychol Ingulstlc 
profile of Individuals for assessment of development, for de- 
veloping reading materials better suited to age groups, for 
better understanding sequences In language development, and In 
Inter-cultural comparison. A caveat Is that with their small 
sampling (40,000 words) In one city (Salt Lake - 1965) far 
removed In time and space from the Eldredge sample (Buffalo - 
1911) of only sex'(male arid of a different age group from the 
Eldredge count (grammer school vs adult) one must be careful 
about drawing sweeping conclusions. 

20. Belevltch, Vitold On the statistical laws of linguistic distributions. 

Annales de la societe sclentlflque de bruxeles , 1959, 73, 310-326. 

"The rank-frequency diagrams of statistical linguistics are re- 
interpreted as distribution curves of the cumulative probability 
of types In the catalog versus the probability of tokens in the 
text. For such distributions, the closure condition 2p| « 1 (which 
does not hold In general statistics for the independent variable) 
imposes certain relations between the mean, the variance, the 
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(continued) 

number of elements In the catalogue and the average Information 
content (negative entropy). Sections 2 to 4 are devoted to the 
mathamatlcs of these relations, especially to their particular 
forms from truncated normal distributions. First and second 
order Taylor approximations to an arbitrary distribution law take 
the form of Zipf's and Mandelbrot's laws, respectively. Experi- 
mental data approximate a truncated normal dl;" tr ibution with 
0 " 2.8 bits as the general law for words. Data on letter and 
phoneme distributions seem to Indicate that the standard deviation 
has a universal value of a » 1.4 bits." 

Belonogov, G. G. Raspredel inie castot pojavlenija flektivnyx klassov 

Rasskix slov. (Frequency distribution of the Inflected word classes) 

(In Russian.) Problem Klbernetlkl . 1964, 4, 189-198. 

Statistical data concerning ttie distribution of inflected word 
classes In Russian were obtained by a computerized count from 
some half a million words of text. 

Berckel, J. A., Van, Th. M., Corstlus, H. Brandt, Mokken, R. J., and 

Wljngaarden, A. Van, Formal Properties of newspaper Dutch . Amsterdam: 

1965. 

Some 50,000 words were obtained from the issues of ten Dutch news- 
papers that appeared on June 19, 1959. In addition to examining 
the differences between the newspapers, the authors provide 
statistics for letter combinations, syllables, the rank-frequency 
relation of words, word length and type-token distribution of 
words . 

Berger, K. W., An evaluation of the Thorndike and Lorge word count. 

C enter States Speech Journal . 1971 , 22^ (1), 61-64. 

"The publication by Thorndike and Lorye on the frequency of word 
appearance in English Is often quoted as be!ng representative of 
English speech. To examine possible differences in tlie word count 
by Thorndike and Lorge with contemporary printed materials a com- 
parison was made between that work and a sample of 10,000 words 
taken from the November 20, 19^9 issue of the "New York Times." 
The findings suygest a substantial but not dramatic difference 
between the tv/o counts. Word comparisons from other contemporary 
printed sources would be useful, but researchers could concentrate 
their energy toward open word classes." 
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2k, Berger, K. Conversational Engl i sh of university students. Speech 
^'lonographs . 1967, 3^(1), 65-73. 

"A study examining sentence length, phonetic content, word length, 
grammatical content, and word usage in 'Student spontaneous speech. 
Sentences were collected and transcribed In Informal settings. The 
average sentence was found to be 7.8 words with 23.5 phonemes. The 
most and least common phonemes are noted. The words "I" and "You" 
accounted for 7*2% of all words collected; 12 words comprised 2S% 
of the words used; 50 words accounted for of the conversations. 

Verbs appeared more frequently :han a.iy other part of speech followed 
by pronouns and nouns. Agreement In phonetic content and word fre- 
quency was found between these data and those of previous studies 
leading to the conclusion that these 2 parameters are reasonably 
stable in usage from late childhood through adulthood." 

25. Berger, K. , The most common words used In conversations. Journal of 

Communication Disorders . 1968, J_(3) . 201-214. 

"Unguarded informal conversational vocabulary from a general adult 
population was sampled in the northeastern Ohio area. The sample 
produced 25,000 words of which there were 2,307 different words. 
A limited vocabulary usage and simple words as compared with more 
formal speech and with printed English. The words found in the 
present study are presented in an appendix. Tiie appendix give?/ all 
of the words found, in alphabetical order, and Includes variants 
of the base word where syllable length does not chari3e. The 'use- 
fulness and application of oral vocabulary as oppor: sd to written 
vocabulary are briefly discussed. Further samplings of conver- 
sational speech, in spite of the difficulty as contrasted to 
printed materials, are recommended, particularly to determine con- 
sistency and variability based on geographical areas." 

26. Berry, Jack Some statistical aspects of conversational speech Communi- 
cation theory , ed. Willis Jackson, New York and London: 1953, 392-^01. 

The article reports on an investigation of stress patterns in a 
24, 781 word sample of conversational speech; the Incidence of 
stress in high frequency function words is given partrticular 
attent ion. 

27. Berry, Jack Oral data collecting and linguistics in Africa. Folklore 
Institute Journal, 6, 1969, 93-110. 
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Discusses prableos of select lnq_ Informants, el let t!nq.and recording 
oral data. The article contains ^n appendix by Earl Stevick un the 
making and use of field tapes botn for raw materiah and as a basis 
for pedagogical treatment* 



28. Black, John V/., and Ausherman, Marian R. The vocabulary of colle' ^e 



students in classroom speeches * Columbus, Ohio: The Ohio State Univer- 
sity Bureau of Educational Research, 1955* 

This study extends a prior study by Ausherman in 1950 entitled 
"Formal Spoken Vocabulary of College Students" and work done for 
the Office of Naval Research (ONR) by Kenyon Colleqe and Ohio 
University's Research Foundation. The informants were 27^ male 
college students with samples taken from 607 classroom speeches* 
The objective was to obtain oral colloquial vocabulary in extam- 
poraneous speech situations* The students were net typical 
college students, however. They were military enlisted personnel 
largely from the Midwest who were taking a background course In 
preparation for specializing In meteorology while In the service* 
They were highly Intelligent, had high scholastic credltlals, . 
and high aptitude In mathematics. The samples were In general 
3 1/2-4 minutes of speeches lasting five minutes on the average. 
The students used a microphone but thought It was a pubUc address 
device since the recorder was In another room. Speeches had 
generally been outlined, but had not been written out or rehearsed. 
Reco'-dlngs were transcribed for statistical analysis. Procedures 
for enumerating Inflections followed Thornd Ike's procedures as far 
as possible. The corpus amounted to 288,152 running words including 
6,826 different words. Frequencies ranged from 15,000 for "the" 
to nearly 2,000 words which occurred only once. Comparison of the 
statistics (oral-1955) with Thorndlke's Teachers' Word Book of 
20,000 Words (Printed) (either 1931 or ]3k^) was ambiguous. All 
of Thorndlke's categories were represented In approximately the 
same relationship to each other as in his list, but were distributed 
differently In oral statistics. For example, Thorndlke's first 1000 
words accounted for only 14 percent of the first 1000 oral words 
(In order of frequency). In addition, 662 or nearly 10 percent of 
the oral vocabulary could not be found in Thorndlke's 20,000 words. 
This is partly accounted for by the fact that many of the 662 words 
were neologisms, slang, occupational jargon, and colloquial com- 
pounds which employed non- and un- prefixes to form antonyms. Al- 
though groups of words in the speeches roughly corresponded to the same 
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(continued) 

groups in written counts, there were many words In written lists 
that were not in the oral. Interestingly of Dewey's nine words 
maktng up 25 percent of English words used, the same words make 
u[, 22 percent of the oral list. Further, aW the first 50 most 
conmon words In the oral list were found in Dewey's first 83 words 
and all but three (no, my, and me) of Dewey's first 50 words were 
founi in the first 100 words. The authors note that two factors 
favonbly aff t the ability of a listener to understand oral lan- 
guage: familiarity (related to frequency) and number of syllables, 
with the former more important than the latter. Th^y also note 
that oral vocabulary tends to be more restricted than written. The 
data from the study are presented in two lists: (1) A listing of 
words In descending order of frequency with breaks and summaries 
at selected frequency limits; e.g., 1000 and above, 100*999 and 
50-99. (2) An alphabetical listing of words keyed to the related 
frequency groups in which the same words will be found in List I. 

Blankenshlp, Jane. A liriguistlc analysis of orol and written style. 

Quarterly Journal of Speech . 1962, fi8, 1 9-^22. 

This study of four i'<amples each of writing and forma! speeches 
analyzed according to the method of C.C. Kries^ the percentage 
of occurrence of each word class by position in the sentence 
and subcategories of the verb are studied. The author concludes 
that syntactic structure is more indicative of individual style 
than of the mode of discourse. 

Bloch, B., and Jorden, E. Guides' manual for spoken Japanesej basic 

course, units 1-30 . New York: Henry Holt and Company, 

This book is almost entirely in Japanese. Section A includes 
basic sentences, pronunciation practice, practice in basic 
sentences, notes, exercises, check-up exams, and review of 
basics. Section B Is the same as A for different basic sen- 
tences. Section C covers final cherk-up, listening in and free 
conversation. (Also published for the Armed Forces.) 

Bongers, H. The history and prlnclpUs of vocabulary control . Woerden, 

Holt and: Wocopl , \S^7* 

The book was written In the context of teaching foreign languages 
In general and English In particular. While recognizing the pro- 
blems of syntax or word usage for the person learning a language, 

19?^ 
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31. (continued) 

this book concentrates only on vocabulary. The bool< consists of 
threq parts: Part 1 is a general treatment of vocabulary including 
definitions, and history. It also includes Palmer's (Belgian) con- 
iributions, graded texts, quantitative statistics, classroom voca" 
uularics, basic English (negative conclusion), world language and 
a comp«irlson of several word lists: Thorndike--20,000, Faucett and 
Maki— 6,000, Palmer--3,000, Palmer, Faucett and Wey"2.,000, Palmer 
and Hornby--l ,000, and l:aton--739. From a study of th« above, the 
author derives a new 3,000 item word list. (The KLM List). Part 
2 is ei critical review of various word lists and includes thirteen 
appendices and a bibliography. Part 3 is a tabulation of ihe 
author's KLM List. 

32. tiooth, Andrew U. A 'law' of occurrences for words of low frequency. 
information and Control . 1967, 10, 386-393. 

"The way in which the number of words occurring once, twice, three 
times, and so on in a text is related to the vocabulary of the author 
has been investigated. It Is shown that a simple relationship holds 
under more general conditions than those implied by Zipf's law." 

33. Borko, Harold (ed.) Automated language processing; the state of the art . 
New York: Wi ley, 1967. 

This is a collection of eleven original essays divided into three 
parts: "Language Data Processing," "Statistlcdi Analysis," and 
"Syntactic Analysis 

3^. Bourne, Charles P., and Ford, Donald F. A study of the statistics of 

letters in English words. Information and Control . 1961, ^, ^0-67. 

"Data whi:h had previously been published by several authors 
to describe the statistical characteristics of English words 
were ex^^mlned to show the extent of their agreement. In addition, 
a detailed empirical study was made of two special types of 
English words: subject words and proper names. The statistical 
parameters which were measured and compared are: the distribution 
of lei:ters, the distribution of terminal letters, the composite 
or total distrifjution of lutterSf the di strlbiition of characters 
for ejv'h letter position, the distribution of characters for each 
lett'ir position, the distribution of bigrams, and the distribution 
of word length.' . ' 
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35. Bowep, John H. Frequency stability of adjective crait names. Psyct.-^ioglcai 

Reports . 1972, 30, kJl'W- 

"Using frequencies from the earlier Thorndike-Lorse and the 
later Kucer<^- Franc is frequency counts, a lognormal distrbucion 
model is applied to judge shifts in the frequencies of occurrence 
of trait adjectives from a likeableness scale. In the time 
between frequency counts, the frequencies of the adjectives 
shifted an average of approximately .68 words per million to- 
^'^rd higher frequencies of occurrence. The amount of shift 
wojid probably not vitiate the general izabi 1 ity of results 
based upon the Thorndlke'Lorge count." 

36. Brain, J.L. Basic Structure of Swahili . Syracuse, New York: Syracuse 

University Program of East African Studies, 1968. 

T.il'^ was an interim grammar of Swahili until a full refe^'ence 
grammar could be produced. It was written in East Africa as 
a teacher's guide and students' reference for an oral Swahili 
course. It is designed for the quicker coverage (two semesters) 
of the five semesters for the Foreign Service institute Course. 
The lessons take up various aspects of basic grammar. There is 
a basic vocabulary and series of exercises with Swahili and 
Engl ish translations. 

37. Brain, J.L. A short dictionary of social sciences terms for Swahili 

!>peakgrs . (Program for East African Studies, Occasional Paper ^51) 

Syracuse, New York: Syracuse University, September, 1969. 

The purpose of the dictionary is to provide Swahili speakers a 
vocabulary in the social sciences in the form of a dictionary. 
Terms were selected from UNESCO's "A Dictionary of the Social 
Sciences" by Gould and Kolb (ed.). 

38. Brcin, J.L. Basic structure o^ SwahMt, Part 1 1 (a background t o the 
SwaMll language and advanced exercKes) . (Syri'xuse University Program 
for EaU African Studies) Syracuse, New York: Syracuse University, 
August, 1969. 

This booklet contains a brief background of Swahili .n pages 1 to 

19. The exercises (pages 21 to 3'*) provlc<e pra'-tlce In useful >' 

sentences and also provide the vocabulary t:o understand them. 
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39. Brdin, J.L. A social science vocabulary of Swahlll . (Program for 

Edst African Studies, Occasional Paper ^33) Syracuse New York: Syracuse 
University, 

The vocabulary Is the begl inlmj of the dictionary for personnel 
studying Swahlll and Swahili ar^as. It is based on newspapers 
and political manifestos. It Is arranged In Swahi 1 i -Engl ish 
ordering. 

kO. Buchanan, A., and MacPhee, E. An annotated biwOlograph^ of modern 

language methodology . (American and Canadian Committees on Modern 

Languages, Toronto, Canada: University of "Toronto Press, 1928, 8^. 

This bibliography is arranged according to subject matter, such 
^ as: references, histories^ aims and methods, learning processes, 

tests and examinations, texts used abroad, and miscellaneous. 
It is obviously dated. 

Al. Buchanan, M. A graded Spanish word book . (American and Canadian 

] Committees on Modern Languages) Toronto, Canada: University of Toronto 

.• ress, 1927, 3. 

In his introduction, Buchanan refers to earlier frequency studies 
in other languages: Kaeding in German, I898, Thorndike in English, 

i l?2l, and Henmon in French, \^2k. The purpose of preparing this 

frequency wore list was to provide material fc graded vocabulary 
testi, but It has become a s'tiandard, consistently used and referred 
to by i^te** compilers cf word lists In Spanish and other languages. 
The author took samples of 30,000 words each from kO categories of 
printed material which were grouped under seven subject headings to 
obtain a total corpus of 1 ,200,000 running word;> . Subject headings 
Included: plays, novels, verse, folklore, miscellaneous press, 
technical llteratire, and periodicals. Buchanan made the assump- 
tion (since record itg devices were not wuli developed at the time), 
that an oral word cotnt would not differ materially from his 
written one. Buchanan '!id recognize that what he developed was 

; an "essentldl" word llsv which would have to be augmented in 

technical and special I zeo areas. To give weight tc words which 
appear in many or most of ihe i^0 categories, the number of cate- 
gories was divided into the frequency and the quotient multiplied 
by 100 to give a credit numbet The types were found to be 18,331 
out of the 1,200,000 running wor v. 5»32'« words had a frequency 
of 10 or more. Buchanan ellmlnatt.'- iSj words from his count as 
belnrj too cof.-imon: they do appear & v">habetical ly in Part 1 of h'.s 
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41. (confined) 

list, however. Part I lists the total word count tn order of fre- 
quency. These words appear 10 or mere times (frequency of 10 or 
more), or they must appear In at least 5 of the kO categories. Pprt 
3, provides an alp.iabetlcal listing of the words giving their fre- 
quency, range, and merit number. 

A2. Buettner, C. Basic Instruction in the Swahlli language (self-Instruction) 

Huelfsbuechlein fuer den ersten unterricht In der Suahl 1 i-sprache) . 

Leipzig, CO.: Welgel Nachfolger, 1891. 

This book Is In German. It Is a booklet of grammar and exercises 
for the German speaker (reader) to use in learning basic Swahlli. 
It updates some of Bishop Steer's work but It Is obviously not 
current. 



A3. Bull, William E. Natural frequency and word counts. The Classical Journal . 
19A8, ii, 469-A8i». 

The subtitle of this article truly rer>resents its content: "The l. 
Fallacy of Frequencies". It Is an extremely interesting study which 
helps explain some of the devices to which word-counters have had to 
resort in compiling their lists (e.g. Listing 50-150 most, common word 
separately at the beginning of the count before assigning frequencies; 
addition of utility or available words (the concrete nouns) which 
carry situational meanings but, because situational or specific, have 
very limited frequencies In any limited sa^ipling of printed, written 
or spoken language, and the problems of tapping the semantic or con- 
tent-bearing words without v/hich the lexical units convey no— or er^ 
roneous— communications. There appears to be an inverse relationship 
between na^jral frequency of parts of speech (I.e. the total number of 
Sndlvldual words of a type such as noun or verb) at least In Indo- 
Earopean languages, and the frequency with which thete words are used. 
I.e. the greater the number of Individual content-beiaring words re- 
lating to specific Items or situations, the less frequenctly they will 
each be used, whereas the lesser number of linguistically useful words, 
such as conjunctions, ar'clcles, prepositions, and relating verbs which 
tie the content-bearing words together ari used over and over again 
regardless of the situation and thus generate a high statistical fre- 
r^uency out of proportion to their utility in learning a language. The 
author's summary Is illuminating as to his points: (1) any word count 
is a statistically valid report only cn what is included within it, 
(2) extremely high-frequency words arc rarely the content-bearing 
elements of any communication, (3) range and frequency of words are 
determined by two sets of forces: linguistic and cultural, (A) It 
cannot be assumed that thnre is a correlation between frequency and 
utility. (5) word counts based on the hypothetiipal existence of the 
(any) language as a static entity cannot be considered a valid re- 
presentation of a people's cultural and llr;gulsltc activities and 
hence are of dubious value from a pedogoglcai point of view. The 
author's final Indictment comes in his last 




A3, (continued) 

paragraph: "From the foregoing evidence it would seem proper to 
draw the conclusion that there are so many factors and so many 
uncontrollable elements In life and language that no satisfactory 
results can be obtained by attempting to reduce natural hetero- 
geneity to an artifical homogeneity by statistical methods. It 
may be concluded, although it Is done so with considerable re- 
luctance by the writer, that word counts cannot be considered a 
valid representation of a people's cultural and linguistic ac- 
tivities and that as a result their pedogogical usefulness is 
extremely dubious." 

kk. Bull, William E. Natural frequency and word counts. Classical Journal. 

19^9. iii, A69-A84. 

"1. Any word c:*unt is a statistically valid report only on what 
is included in it. 2. Extremely high-frequency words are rarely 
the content-bearing elements of any communication. 3. Range and 
frequency of words are determined bv two sets of forces: cultural 
and linguistic, ^i. It cannot be assumed that there is a corre- 
lation between frequency and utility. 5. Word counts based on 
th'j hypothetical existence of the (any) language as a static 
entity cannot be considered a valid representation of a people's 
cultural and linguistic activities and hence are of dubious value 
from a pedagogical point of view." 

kS* Burton, N.G., and Liclclider, J.C.R. Long-range constraints in the 

structure of printed English. American Journal of Psychology , 1955, 

60, 650-653. 

"An experiment modeled after Shannon's was conducted to determine 
the extent to which estimates of the letter redundancy of English 
texts rire dependent upon the number of preceding letters known 
to the subject. Data obtained Indicate that, while the estirgate 
of relative redundancy Increases as knowledge of the foregoing 
text Is extended from zero to approximately 32 letters, increasing 
the known number of letters beyond 32 does not result in any 
noticeable rise.' 

A6. Oushnell. Paul F. An analytical eftn»r««» of orjil ^nH written gnqUch. 
New York: Teachers' College, 1930. 

Various aspects of the language of student compositions are 
correlated with expert judgments of their merits; sentence- and 
word- level features are measured and "errors" of various kinds 
are tabulated. 
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4'/. Card, Wtlllam, and McOavid, Virginia English words of very high 

frequency. College English . 1966, 22., 596-60A. 

The authors examine a variety of word frequency counts and discuss 
the biases inherent in them. 

48. Carroll, John B, Diversity of vocabulary and the harmonic series law 

of word-frequency distribution. The Psychological Record. 1938, 2^, 

379-386. 

The author Is Interested In a diversity equation whereby the 
relation of the number of different words In a vocabulary can be 
estimated despite variations In sample size. Illustrative 
material Is provided from Santayana's, The Last Puritan , Han ley's 
word count of Joyce's, Ulysses . and the word-frequency lists 
comp 1 1 ed by E 1 d r I dge and Dewey . 

kS, Carroll, John B. How often a word? Review of John W. Black and Marian 

Auscherman's the vocabulary of college students In classroom speeches. 

Contemporary Psychology . Columbus, Ohio: Bureau of Educational Research, 

Ohio State University, 1946, 1, (7), 220. 

Carroll calls the Black and Auscherman count the most extensive 
oral one yet and welcomes It, since he believes the Thorndlke 
word-count was not really representative because of Its heavy 
emphasis on the Bible and older literary forms as opposed to 
contemporary sources. Carroll also believes that the new count 
will be helpful In controlling the word frequency factor in 
future experiments. In parting, Carroll questions rhetorically 
(and without answer) whether a spoken vocabulary Is different 
from a written one. 

50. Carroll, John B. The contributions of psychological theory and edu- 
cational research to the teaching of foreign language. Modern Lan- 
guage Journal . 1965, 49, 273-281. 

"This address, given at the International conference on nodern 
foreign language teaching (Berlin, September 1964), presents a 
general discussion of the present scope, role, and potential u,>«i 
of research In foreign language teach ina methodology, and main* 
tains that the best research Is that which is closely allied witli 
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50. (continued) 

theory, and th»j hardest to apply is that which has direct bearing 
on clussroom procedure. It points out the great "icope for devel- 
opment In the theory of foreign language learning, citing favorably 
the work of rnatnemat leal learning theorists who have devised exact 
equations for the rate at which material Is learned or forgotten. 
The need of forming an accurate theoretical comparison between the 
^'audlol ingual habit** and '^cognitive code^learnl ng*' theories Is 
discussed, such experiments being difficult to control since It 
is almost Impossible to predict the exact techniques a student 
will employ and since the theoretical contrast has not been suf- 
ficiently well conceived. Neither method is based on modern 
theories of the psychology of language learning, and the discus- 
sion concludes with a critical comparison of the two, recommending 
a jo! nine) of audiollngual technique with some of the better elements 
of voynitive code- learning theory.** 

51 • Carroll, John B, Review of G. Merdan^s the advanced theory of language 

as choice and chance. American Scientist , New York: Sprlnger-Verlag, 

1?66, Sit. ^80A-^tOIA. 

Carr-)]] does not like this book. He says It Is mainly a reprint^ 
with some exceptions, of parts of Herdan's earlier works. Inclu- 
ding his earlier book with the simpler title of Language as a 
Choice and Chance . He does not see how the material can be 
cal led ^'advanced**, that It is at best elementary and it In some 
cases indicates a retrogression from Her dan earlier books. He 
concludes by saying that in spite of some provacatlve material, 
Herdan has revealed himself as behind the times In linguistics 
and cannot pass as a mathematician's linguist or a linguist's 
mathematician. 

52. Carroll, John B. On sampling from a lognormal modei of word* frequency 

distribution. Computational analysis of present-day American English , 

Providence, R.l.: 1367, 406-A13. 

**ln our Investigations thus far we have not yet arrived at an 
efficient method for estimating the parameters of the theoretical 

population from the characteristics of a sample*' (page 'fl3). 
The attempt to determine such a model from the empiric£.l data 
in the Brown Corpus Is discussed. 
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53* Carroll, John B. An alternative to Jui I land's usage coefficient for 



lexical frequencies, and a proposal for a standard frequency index 
(SFI) . Computer Studies in the Humanities and Verbal Behavior , 1970, 
3, 61-65. 

"A new word u$.a9e coefficient, U , is proposed. It is an adapta- 
tion of Juilland's U but in contrast to U it (1) can be computed 
from frequencies In unequally-sized categories, (2) uses a more 
appropriate measure of the dispersion of word probabilities over 
categories, (3) will not take the value zero when all occurrences 
are concentrated In a single category, r^nd (4) is always scaled 
In terms of a corpus of a standard million tokens. Computations 
are given for illustrative data and discussed. For many purposes, 
however, a logarithmic frequency scale Is more convenient and mean- 
ingful, and it Is thus proposed that frequency data be scaled 
according to the formula SFI - 10 (loglOP ■♦■ 10), where SFI Is the 
Standard Frequency Index and p is the probability or proportional 
frequency of the Item. An equivalent formula based on U^, is SFI = 
10(log]oUm + ^) • Por most data from standard frequency counts, 
values of SFI will range from 35 to 90, each unit Increment cor- 
responding to an increase of about 25.9 percent in frequency." 

S't. Carroll, John B. Measurement properties of subjective magnitude 

estimates of word frequency. Journal of Verbal Learning and Verbal 

Behavior , 1971, iO, 722-729. ' 

'Stevons' subjective magnitude estimation (SME) method was used 
In obtaining estimates of relative word frequency from two adult 
groups (15 lexicographers, 13 other adults) for 60 words ranging 
widely in objective frequency. Lexicographers rendered more re- 
liable estimates, and their averaged data correlated more highly 
(.970) with objective log frequency than those of the second group 
(.923). The objective frequency of the first stimulus considered 
In the SME task is not related to an S's overall accuracy in pre- 
dicting objective frequency, but accuracy Is related to the S's 
tendency to perceive frequency ratios as relatively large. Subjec- 
tive estimates measure available objective counts, and may be more 
valid measures of true word probability." 

55. Carroll, John B. Current Issues in psychol I ngu 1st ics and second lan- 
guage teaching. Paper presented at the Fifth Annual TESOL Convention, 
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New Orleans » La., March, 1971 » Eric Accession No. ED«052^643« 

''Seemingly conflicting points of view concerning language tnstruc-* 
tion which are expressed In various teaching methodologies are 
reconciled in this paper. Key Issues discussed include: (1) the 
nature of linguistic rules and their relation to the ''habits'' of 
language use, (2) the role of grammatical theory In language 
teaching, (3) the nature of language learning, (k) a balance 
between an audiolingual habit theory and a cognitive code theory, 
and (5) some of the critical variables in language pedogogy. The 
author Illustrates why the field of language instruction has be*- 
come characterized by pedagogical uncertainty and concludes that 
the teacher's ability to manage learning behavior remains one of 
the most unexplored, unstudied variables in educational research." 

56. Carroll, John B. Behind the scenes in the making of a corpus*-bdsed 

dictionary and word frequency book. Paper presented at the meeting 

of the National Council of Teachers of English, Las Vegas, Nev., 

November, 1971 » Eric Accession No. ED*056*8^2. 

The publication of the American Heritage Word Frequency Book and 
the American Heritage School Dictionary marked a new advance In 
the technology of dictionary and word-frequency book construction. 
The use of high speed computers enabled the compilers to analyze 
five million words from a body of materials frequently used in 
elementary and junior high schools. New mathematical techniques 
have improved the accuracy and scope of word frequency analysis. 
The word frequencies are listed by grades, thus enablinc; teachers 
and writers to get accurate information on the specific level they 
are interested In. References are included^ 



57. Carroll, John B., Davies, P., and Rlchman, B. Word frequency book . 

New York: Houghton-Mifflin Company and American Heritage Publishing 

Company, Inc., 1971. 

This is the most recent of vocabulary counts; and It Ts an 
excellent one> although It has limited application to adult 
spoken language since its samples were drawn from printed 
English to which children in grades 3 to 9 are exposed, with 
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57* (continued) 

emphcisis on grades k to 8. In addlt ion^ ''it concentrates on words 
rather than meanings so that the semantic part of word counts is 
not covered. The publication begins with a foreword and notes 
on the development of the corpus by Richman, notes on the Sta* 
tistical Analysis by Carroll, and New Views on a Lexicon by Davies* 
The corpus was a computer-assembled selection of 5f088,72l words 
(tokens) drawn in 500 word samples from 1,0^5 published materials 
(texts and other student*-used materials). It contains 86,7^1 
different words (types)* The materials from which the 500 word 
samples were drav/n were textbooks, workbooks, student kits, novels, 
poetry, general non-*f let ion, encyclopedias, and magazi nes*-ds of 
November and December 1969, The samples ' reflect 22 subject areas, 
17 of which were curriculum areas, three library categories, a 
magazine category, and a miscellaneous category which eventually 
turned out to be devoted principally to religion. The sampling 
of 1,0A5 texts was taken from 6,162 titles submitted in response 
to a national survey of U.S. schools, including public, Roman 
Catholic, and independent (private) schools, The 1,0^5 texts 
were in about ^6 percent of the replies, although they constitute 
only about 16 percent of the 6,162 titles submitted. Machine 
processing of the data provided two types of output: citations**- 
occurrences of types extracted in sufficient context to provide 
for the construction of definitions later forming the basis for 
the American i^eritage School Diet lonary**and descriptive statistics- 
frequency of occurrence and distribution. The statlsitcal work is 
based on the lognormat model developed by Herdan, The results are 
displayed in alphabetical lists with frequencies indicated, clas- 
sified by grade and by subjects. They are also displayed in fre- 
quency rank lists and frequency qrouped distribution lists by total 
corpus, by grade, and by subject. 

58. Carroll, John B., and Lamendella, John T. Subjective Estimates of 

Consonant Phoneme Frequencies . Educational Testing Service Research 

Bulletin RB-72-II, Princeton, New Jer:>cy, 1972. 

'^Subjective magnitude estimates of the frequencies of 2k consonant 
phonemes were obtained from 65 university students, some with 
training In linguistics, by a method that had been used by Attneave 
(1953) for judgements of fetter frequencies. Reliabilities of 
averaged judgements for comparably si:?ed groups of 30 judges were 
estimated as In the neighborhood of •SS. Averages of logarithmi- 
cally transformed judgements were correlated with log frequencies 
from objective counts with coefficients in the range .736 to .076 
(or .76^4 to .907 when corrected for attenuation). Despite the 
high reliabilities and predictive validities, there was evidence 
that the judgaments were strongly Influenced by experienced fre- 
quencies of letters of the alphabet.'* 
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59» Chjplln, M., Martin, and Mlhonmatsu, R. Advanced Japanese Conver* 



sat ion > U.S. Department of H.E^W. , Contract OE-S-I'J-OOS. 

This book on advanced Japanese is designed as a follow-on to 
basic texts such as Jorden and Chaplin^s Beginning Japanese 
v^hich provides considerable facility in conversation. This book 
#;xpands the conversational capability of the student by using 
three scenarios with a variety of realistic situations. Tape 
^'ecordings were made by professional actors In Tolcyo. Each 
scenario Is backed up by vocabulary, notes and drill sentences. 
The sentences give practice In the grarmatical points involved. 

60. .:).mvky, N. Logical syntax and semantics (their linguistic relevance). 

L ^mjuaqit ^ January-March, I955> Jl, (Part 1). (Bobs-Merrill Reprint 

L-3.) 

In this paper, Chomsky comments on the Bar H1 1 lei paper on the 
s^\ne subject. Basically, Chomsky takes Issue with Bar Hi I lei 's 
premises that logical syntax and semantics have disciplines or 
sub-disciplines which really furnish solutions to linguistic 
problems, especially those of transformation and semantics (as 
known at the time, i.e., 1955) • Chomsky holds that they do not 
provide groundi^ for determining synonymy and consequent relations; 
they only point out that consequence Is a relation between 
sentences, and synonymy a relation betv/een words « Acknowledging 
that semantics Is divided Into a theory of reference and a theory 
of neaning, Chomsky states that Carnap's theory of meaning on 
v/liich Bar Hlllel bases his arguments Is Inadequate for linguistics. 
As for Bar HillePs citation of M.V. Qulne and TarskI In defense 
of 'Waning'^^ Chomsky says It is a mistake since their work was 
principally on the theory of reference which is of little use to 
linguists. Chomsky then takes issue with Carnap on the matter of 
models since Carnap believes artificial languages are necessary 
to the study of natural languages. Chomsky remains skeptical 
that a useful nx>de1 can be constructed. Chomsky concludes by 
stating lie believes Bar Ml I lei misunderstood Harris In his 
criticism of him In his article and then he objects to the thesis 
that incorporating logical syntax and semantics Into linguistic 
theory v^lll solve certain of Its problems in that the theory of 
neaning in natural language Is In any way clarified by constructing 
artificial languages In terms of rules which are called synonymous. 
Chomsky says we can solve the problems of synonymy and trans- 
formation in English In one of the following two ways, the latter 
being the better: by listing synonymous pairs under the heading 
**synonyms in granxiar*^ and transformational pairs under the heading 




60. (continued) 



of "transformations" or by flndlnq operational tests to determine 
t'cir relationship and eliminate r/ie nued for arbitrary listings. 

61. Chonisl^y» li. Review of 'Verbal Behavior' by B.F. Skinner. Language. 
January-March, 1959. (Bobs-Merrll Reprint Series In the Social Sciences 

In his book, Or, Skinner provides a functional analysis of 
"Verbal Behavior" in the context of his behaviorist psychology. 
In general, Chomsky disputes Skinner's claims, largely on the 
basis that Skinner's observations of the behavior of the lower 
animals cannot be applied in any really profound way to human 
behavior. Chomsky describes Skinner's concepts one by one and 
attempts to prove they do not describe verbal behavior if taken 
literally, or if taken metaphorically they do not add to current 
knowledge. 

62. Chomsky, N. and Hlller, G. Finitary models of langjage users. Handbook 

of Mathematical Psychology (Chapter 13) . New York: John Wiley and 

Sons, Inc., 1963, 2, k]S-k^\, 

This chapter considers some of the models and measures that hdve 
been proposed to describe talkers and listeners, i.e., the users 
rather than the language Itself. It Is based on the fact that 
there Is a distinction that a person's knowledge and his actual 
or potential behavior are not the same, so a formal characterization 
of a language Is not at the same time a model of users. The 
authors state that in considering models for the actual perfor- 
mance- of human talkers, an Important criterion of adequacy and 
validity Is the extent to which the model's limitations correspond 
to actual human limitations. Two finite models are considered: 
the stochastic and the algebraic. The chapter concludes with a 
section on "Towards a Theory of Complicated Behavior". In 
constructing models, only the speaker-listener models were used 
instead of one for each. Stochastic theories of conviuni cat ions 
assume the array of message elements can be represented by a 
nrolsabi 1 1 ty distribution and that communicative processes trans- 
form the probabilitv distribution according to transitional 
probabilities. The section on Stochastic models contains a para- 
graph on word frequencies. \Hth algebraic models, the purpose 
Is to const''uct a model for the language user that Incorporates 
a ytnerativc jrarnmar as a fundamental component. This discussion 
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02. (continued) 

concentrates on the listener rind his faculties for perception, 
but only as a matter of convenience ^'Ince the authors consider 
speaker- listener models as proper. Preliminary evidence, points 
to the Chomsky idea of ^'kerncP' or basic sentences that play a 
central role not only linguistically but psychologically as well^ 
as the individual decides how to transform then Into v/hat he 
actually says (utterance) or understanding of what he has heard. 
In considering a theory of complicated behavior, the authors 
take into consideration among other things from linguistic theory: 
information and redundancy, degree of set f -embeddl ng , depth of 
postponment, structural comple>;ity, and transformational com* 
picxi ty • 

63* Chomsky, N. and Miller, G. Introduction to formal analysis of languages. 

Handbook of Mathematical Psychology (Chapter II) , Luce, R., 9ush, R., and 

Galanter, C. eds. New York: John Wiley and Sons, inc., I963» ^« 

in this study, Chomsky and Miller state that the fundamental fact 
that must be faced in any Investigation of language and linguistic 
behavior is that a native speaker has the ability to comprehend an 
immense number of sentences he has never heard before, and to 
produce as the occasion requires, novel utterances that are under- 
standable to other native speakers. In Chapter II, they try to 
explain the following questions In elucidation of the statement 
above: V/hat Is the precise nature of the abllitythe nature of 
language itself? How Is the ability put to use, I.e., can we 
develop a model for users of a natural language? Mow is the ability 
developed in an individual? (Chomsky stiil rejects Skinner's 
characterization that language Is a set of verbt^l responses.) 
Chomsky and Miller propose a theory of linguistic structure which 
must specify the class of possible sentences, che class of possible 
grammars, and the class of possible structural descriptions, and 
must provide a uniform and fixed method lor assinging one or more 
structural descriptions to each sentencer generalized by an 
arbitrarily selected grammar of the specified form. The authors 
develop two conceptions of Itnguibtic structure: a con^-tituent 
structure grammar and the theory of tran«*format lonal grammar. This 
book can be seen as part of his contlnul^ig evolution of thought 
on linguistics starting with his revolutionary ^'Syntactic Structure''. 
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64. Choclos, J. u. A st'iilstical and comparative analysis of individual 



written language samples. Psychological flonographs . 75-11 i. 

"The present inv»j? t i«jot ton is concerned with the relation of 
certain languayc variables to (1) the length of sample fron 
which they arc derived and (2) certain psychologically pertinent 
factors. In general, the language measures employed are based 
on a count of the number of different, words (types) and the 
relationship of such measures to ihc total number of words, and 
to the factors of I.Q., chronological age, locality (city, town, 
rural), a^id !:?x..." A thousand samples of about 3iU00 words 
each were collected from Iowa school -chi I dren over a five year 
per iod. 

63. Chretien, D. G. A new statistical approach to the study of language. 
Romance Phi lology . 1963, JjS, 290-301 . 

Review of Herdan's Language as Choice and Chance . 

66. Cole, L. The Teachers' Handbook of Technical Vocabulary . Bloomington, 

Illinois: Public School Publishing Company , 19'iO. 

This compilation draws on prior studies in various academic dis- 
ciplines taught through high school level. Tlie triteria used 
for vocabularies were frequency of occurrence. Importance (ac- 
cording to experts), and social usefulness. The lists vary from 
Ago to 2000 In each of 13 subject areas, arranged In four 
groupings; mathematics (arithmetic, algebra, and geometry), lan- 
guage (English composition, American literature, and foreign lan- 
guage), social sciences (geography and history), and other 
sciences (hygiene, general science, chemistry, physics, and 
biology). The bo'^l; is arranged with word lists broken down by 
grade level and includes a comparison with the Thorndike 20,000 
word list. The author concludes that since no subject falls 
completely within the first 20,000 most common English words, 
some attention to vocabulary is required before any of the sub- 
jects can be taugiit effectively. 

67. Condon, E. U. Statistics of vocabulary. Sc lence , 1928, 67.. no. 1733 » 
300. 

A discussion of tlie rank-f renuency distribution of words In a 
text rind proposed neans for determining the mathLmatical law 
unde lying the distribution. Carroll objects to this proposal 
^Ince It makes diversity a function of sanple size. 
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68. D»jijl, Japanesf Idioms (NIpongo no tdlom) . Tokyo: San^eido, 1950. 

This book distinguishes- between Idioms and free worJ combinations, 
and between the meanings of the Idioms and tho9c of the words which 
cot ip rise then. 

69. Dale, E., and Razik, T. Bibliography of Vocabulary Studies . Columbus: 

Ohio State University Bureau of Educational Research, I963. 

Corttolns 3J23 titles adding SkZ new titles to the Hrst edition 
published In 191)7. References are arranged under 26 categories 
v^ithoul annotation. It cental na an author in-lcx. 

70. Dale., E., and Rekhert, D. Bibliography of Vocabulary Studies . 
Columbus.' Ohio Stt'iCe University Bureau of Educational Research, 

1957. 

The First Ediilon of the Ohio State 0 Ibl lographic Project 
superceded by the 1963 revision prepared by Dale and Razik. 

7!. Davles, A. (ed.) Lancjuagc Testing S/mposium * A P^jychol Inguist 

Approach , Lon^ion: CMford University Pres8» 1968. 

This Is a cnnpMntlon of articles ami studies on the language 
testing, Incliding an introduction and one chapter authored by 
the compiler, and oi appendix on itc»^ analysis. There are II 
articles In all> Including the Introduction, but excluding the 
appendlH. Thn introduction covors Irinjuage learning views and 
their irtfluence on lantjuage testing, tl\e uses of language tests > 
evalu'ition in languac^ tcsrino, lannuage test analysis and Dr. 
Lado^s approach to hmgua^^ tebting. There arc four main sections 
or groupings to the book after the i ntroduci ion> although they 
are not listed as such: fir^r section^-evcluatloni linguistics^ 
and psychology (the basic disciplines and their relevar?ce to 
language testing, chapters 2 ^M A), second sect ion-'^users and 
types of test^j, chjpie'*^ 5 to j, third sectlon--the Influence 
of tests on education, chapters g to 1 1 , and fourth section 
item analysis {tho appendix). Tuo chapters are particularly 
relevant to spoken language teaching: chapter 7 on testing 
spoken language; some unsolved problens (G. E. Perren) jnd 
chapter 8 on the testing of or icy (s^;ill in spoken Unguage) 
{h. Ul Ikinson) . 
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72 Oenos, P.B. On the statistics of spoKen English. Journal of the 

Acoustical Society , 1 963, 092-904. 

variety of statlUical Informatlcn ()bout spnk«n English was 
obtained. The data tre the result'i of analyzing a considerable 
body of conversational material and narrr^tlve taken from 'I'hon*- 
r;tic Readers'; the analyses wire carried out by using a digital 
computfc^r. The principles for selecting the speech material 
arc) discussed. Count-i were obtained fo? the frequency of occur- 
rence of phonemes, for the diagram frequencies of phonemes, for 
w rd lent^th, etc. Stress was taken into consideration, and many 
of the statistic! were obtai'^nd separately for stressed and 
unstressed syllables. In ^ddl\:ion, the frequency distribution 
of minimal pairs was obtainecj. Minimal pairs arc the phoneme 
pjirs that minimally distinguish one v/ord from another. All 
results were evaluated from the articulatory point of view. It 
was found that. In spoken English, dental and alveolar articula- 
tions predominate and that manner rather than place of articula- 
tion Is the dinension that carries by far the greatest functional 
load.»» 

73* UeVito, Joseph A. Cotnprehens ion factors in oral and written discourse 

of i»k!lled CO irMjnlc.it.ors. Speech Monographs , 1965> I2^i-12fl. 

DeVito describes his work as *'an attempt to compare written and 
oral samples the v^ork of skilled communicators for (1) over- 
all conprehens ib 9 1 i ty as measured by close procedure and (2) 
significant differences in selected elements supposedly related 
to ease of comprehension.** I tern two includ^is vocabulary measures 
of difficulty and diversity, sentencf^^' ^evei measures, and an 
e^ap)i nation of '^density of ideas.'* 

7k. DeVitOi Joseph A. Psycho^jrammat lea 1 factors in oral and written 

discourse by skilled communicators. Speech Monographs , 1966, 33 i 

'73-76. 

'*Thc concern of the present study, based on 18,000 words of oral 
and written discourse by skilled communicatorsi was with six 
psychogramsnat ica! factors. Oral language was found to contain 
significantly more self-reference termSi pseudo-qual i f y terms, 
allness terms, qualification terms and terms indicative of 
consciousness of projection than written language.** 
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7ri. DeVito, Joseph A. Lev^els of abstraction in spoken and writtten 

!enguage. The Journal of Communication . 1267, J7., 35^-361. 

"SaripLts of 8,000 vvords of oral and 8,000 words of written 
discourse, obtained from speech professors who had written 
extensively, were analyzed for the relative levels of abstrac- 
tion. Oral language was found to be significantly less abstract 
and r.onta!ned more finite verbs and fewer nouns of abstraction 
than written language." 

76. OeVito, Joseph A. l\ linguljtlc analysis of spoken and written 

language. Central States Speech Journal . 196?, 81-85. 

"Samnle^ of spoken and written langu'ige obtained from profes:ors 
of speech who had written extensively were analyzed for the fre- 
quency of the fv')ur major parts of speech arvd for two grammatical 
ratios which measure degree of qualification. Five of the six 
measures employed differentiated the two forms of discour'je at 
statistically significant levels." The measure that failed to 
discriminate the two modes was the noun-verb to adject ive-advcrD 
rat io. 

77. Oewey, G. (Relative Frequency of English Speech Sounds . Cambridge, 

Mass.: Harvard University Press, 1923 (1950 revision). 

Dewey's analysis of thu relative frequency of English speech 
sou'^ds was intended for a;>plicdtion to sh&«'thand, acoustic devices 
such as the telephone, ind phonographs, and to the study of lan- 
guage: change, history, and trends. This book investigates rot 
only sounds (syllables) but also combinations of sounds (words). 
It contains a discussion of previous works in the field of quan- 
titative analysis, Thr data base consisted of samples of written 
text drawn from newspapers, correspondence, novels, and other 
prose sources. Analysis of results revoaled that: 9 words con- 
st it'ite 25 percent of tne 100,000 running words (Corpus), 69 
> 'J H constitute 50 percent of the 100,000 732 words constitute 
7f '-.-rcent of the 100,000, and 1027 v/ords recurred more than 10 
t'tnas in the 100,000. 

78. Oingv^ell, V/. Transformational and Genciative Grammar - A Biblio- 
graphy . Wasiiington, D.C.: Center for Applied Linguistics, 1965. 

The ''urpose was to compile as conplei** a bibliography as possible 
of Piguiitic rulis that relate to sentences. There are two 
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principal parts to this bibliography: published works: books 
and articles, and unpublished works: conference papers. The 
lists are principally the works of the following schools of 
transformational grafnmar: Z. S. Harris (University of Pennsyl- 
va.Ha), n. A. Chomsky (MIT), R. E. Longacre (Sur.imer Institute 
of Linguistics), and S. K. Shaumyan (USSK) . 

Dixson, R. (ed.) Las 2000 palabras .n^as usada-: con mas frecuenctas 

en Ingles (The 2000 most used words with the greatest frequency In 

£ngl Ish) . New York: Latin American Institute Press, Inc., 1956. 

The first 1,000 words follow the Thorndlke-Lorge list. The 
second 1,000 v^ords follow the Interim Report on Vocabulary 
Selection for teaclilng English as a foreign language (Palmer, 
Thorndi^LS, West, Sapir, et^al^.) modified by current Amerlcan- 
Enijlish usage. The words list is arranged alphabetically 
wJthin groupings of 1 to 500, 501 to 1,000, and 1,001 to 2,000. 

Dolby. J. L., Resnlkoff, H. L., and MacMurray, E. A tape dictionary 

for linguistic experiments. Proceedings of the Fall Joint Computer 

Conference 1963, Baltimore and London: 1963, '♦19-^23. 

"A tape dictJonary of some 75,000 entries has been prepared 
with part-of-specch, status, usage, graphemic .y 1 labi fleet Ion 
and stress Information, The entries have been sorted alpha- 
betical ly forward and backward as well as by syllable and by 
part of speech. Compar i f.on-i are being drawn betv/cen various 
measures of usage as well as i^etween to measures of 
the number of syllables in the written form. Considerable care 
has been taken to minimize tht .'umber of errors In the list and 
to insure a high degree of consistency In the coding. The 
authors believe that the resulting listing will be of great 
utility In basic studies of the .d'-.ure of linguistic data 
handling." The project resulted »i the production of: The 
Engl ish Word Speculum . 5 vols. Su-nyvale, Calif.: Lockheed 
Missiles and Space Company, 196^. 

Oriemann, G. H. J. Differences isetween wrinten and spoken language: 

an exploration study. Acta Psychologic a. 1963, ?.0, 36-57 and 78-100. 

The quantitative measures employed in this study include the 
total number of words in each sample, a classification of v/ords 
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by number of syllables, the verb-adjective ratio, and the type- 
token ratio. Texts from the writing and speech of eight psy- 
chology students were studied. 

Ourr, William K. A computer study of high-frequency words in popular 

trade juveniles. A paper presented to The international Reading 

Association, Anaheim, Cat., May 6-9, 1970. 

"Word frequency was determined for library books that primary- 
grade children selected for free reading. A survey of llbraralns 
determined which books these children selected. This list was 
reduced to 80 books through evaluations by elementary school 
teachers. A computer analysis of each word In these books re- 
vealed 105»28o running words. When proper names, onomatopoeic 
words, and easily recognizable Inflected forms and compounds 
were omitted, there were only 3»220 different words In all of 
these books. A frequency count of these different words re- 
vealed that Just 10 words account for almost one-fourth of all 
running words, 25 wo/ds account for over one- third of all running 
words, and 188 words account for almost seven out of 10 of all 
running words. It was suggested that systematic teaching of 
these high-frequency words help Insure that children have the 
background needed to read library materials of their own choosing 
at an early age. References and tables are given." 

Eastman, Carol M. The status of the reverslve extension In modern 

Kenya coastal Swahlll. Journal of African Language . I969, 

(Part 1), 29-39. 

This study Is a fallout of a study on the Vumba, Amu, Bajunl 
and Jomvu dialects of Swahlll conducted by the author In I965-66. 
The procedure was to gather data In two hour sessions. At 
least two Informants were questioned Individually for each 
dialect with respect to 51 verbs. For each one they were asked 
to supply sentences exemplifying Its use. The 51 verbs were 
conmon ones having the largest number of extended forms (The 
Standard Swahl I I-Engi ish Dictionary, Oxford, 1939 was used). 
In a later phase of the stutTy, verbs with commonly occurring 
radical final elements were extracted from the dictionary. Such 
radicals fell Into the category of compound radicals. 
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04. Eastman, Carol M. Markers in Engl ifh-- Influenced by Swahill CQnver- 



sat Ion . (Papers In Internat icmal Studies: African Series No, 8) 

Athens, Ohio: Ohio University Center for I ntc^rnat ional Studies: 

African Program, 1970» 

This study examines one facet of changes In Swahill, the national 
language of Kenya and Tanzania (TannanyUa). It examines the 
use of markers as features of I nterf i^rence In Swahl 1 1 -Engl 1 sh 
blilngu.il conversation. These features Involve the adoption 
of syntactic and semantic deviations In one language which can 
be attributf^d to the other* This paper demonstrcstes clearly 
how foreign words can be integrated into a language (Swahlll) 
in an area where many people aie bilingual (Engl ish-Swahl 1 1) . 
I nte^'ference consists of simplification, lexical Insertion of 
English words, English syntax incorporation, correction code 
s\/itchlng (saying the same thing in both languages), and improper 
narker usage, I.e., as transitional utterances and oral pauses. 
Informants were two Tanzanians studying at a US university who 
were asked to talk about a variety of subjects as if they were 
in th^ir own country. They used Swahlll basically, but with 
considerable English intermixed. They actually talked on eight 
basic subjects. There were 503 utterances of varying length. 
Uata were manipulated using a Burroughs 5500 Computer. The 
markers used were •'nde*\ "nanhl^'^ "kuma**, and "tusema^'. (Mar- 
kers are meaningless, so-called verbal pauses, hesitation words, 
such as 'V^u know*^) The study has an appendix containing 
English words u^cd In the conversations. 

85. Eaton, Helen S* Semantic frequency list for English, French, German, 

uno Spanisii: a correlation of the first six thousand words in four 

s ingi e- language frequency lists . Chicago: iS'^SJi and liew York: Dover 

Publ lea": ions , Inc. , I96I • 

This book is an extension of earlier single language word counts 
across language divisions in an attempt to correlate the word 
frequencies of a group of languages In order to show an Inter- 
lingual relationship anong the concepts measurable by a scale of 
frpcuency of use. In a more specific sense, It Is an attempt to 
establish through relative frequency of use the common conceptions 
of mankind as they find expression in Its various languages; In 
this case, four West and Central European languages, many of which 
have spread worldwide. Words have both form and meaning with 
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85 • (continued) 

meaning haviny the greater variation within any given language* 
Word counts, especially earlier ones, have tended to omit meaning 
(semantic value) and to concentrate on Isolating the word forms 
most frequently used. Since a vocabulary of 500 words may re- 
present a true vocabulary of 1500 or 2000 word meanings, knowing 
the form and a single meaning may not teach more than a small 
part of the living language. Or, Eaton has tried to solve part 
of this problem In four languages with her comparative count 
which includes a sem^intic, as well as form count of some 6,000 
basic concepts. The book is divided Into an Introduction, notes 
for the reader, and seven parts, one for each 1000 of the first 
6000 words, and one for the first part of the seven thousand* 
These are followed by indexes to each of the word lists, an index 
of words deleted from prior English and German lists, and those 
moved from one group of frequencies to another (Appendix 1), and 
a conceptual analysis of substantives, verbs, and adjectives in 
the lists (Appendix II). Sources of the word lists were Thorndlke's 
'Teachers' Word Book of 20,000 Words," Vender Beke's 'Trench Word 
Book.' (6,000 words), Kaedlng's "Frequency Dictionary of the 
German Language" (80,000 words) and Buchanan's "Graded Spanish 
Word Book.' A great advantage of this book Is the careful recording 
of procedures and sources used. 

86. Edmundson, H.P. A statistician's view of linguistic models and language 
data processing. [Natural Language and the Computer , ed, Paul L. Garvin, 
New York: McGraw-Hill, I963, 151-179. 

A survey of mathematical models of linguistic features. 

87. Elderton, W.P. A few statistics on the length of English words. 

Journal of the Royal Statistical Society , 19^9 » 62^» A36-4^5* 

This study examines a wide range of data In an attempt to deter-- 
mine the underlying laws of word-length distribution; Carlyle, 
Macaulay, Bacon» Scott, Swinburne, Johnson, Gibbon, Shakespeare, 
and The Bible provide tne data. Further Information Is provided 
on the internal make-up of English words. Including vowel and 
consonant distribution. 

38. Eldrldge, R. C. Six thousand common English words, their comparative 

frequency, and what can be done with them . Niagara Falls, N.Y.: 

1911. 
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A word count from newspaper prose; a sample of 43,939 words 
yielded 6,002 types. 

Elleg§rd, Alvar Statistical measurement of linguistic relationship. 

Language . 1955, 35, 151-156. 

"Linguistic relationship has been measured statistically by 
means of the product-moment correlation coefficient, The 
linguistic meaning of various forms of this coefficien't is dis- 
cussed on the basis of a simplified model, it is maintained 
that the most satisfactory stutistic measures degree of corres- 
pondence or similarity rather than relationship in the genetic 
sense. When applied to Indo-European data, the statistic results 
in good agreement with common philological judgement. Problems 
of significance are discussed. Finally it Is concluded that the 
statistical techn Ique wi 1 i both require and help to establish 
a taxonomy of languages 

Ellegard, Alvar Notes on the use of statistical methods In the 

studies of name vocabularies. Studia Neoph 1 logica , 1958, 32. 214-231. 

This article discusses various statistical methods for describing 
the distribution of personal names In a given area and concludes 
that some common techniques cannot be enployeH with curtailed 
samples or used for comparing name populations of different 
sizes. He suggests that his remarks on name vocabularies apply 
to vocabulary studies In general. 

Ellegird, Alvar Estimating vocabulary size. Word, I960, 16, 219-244. 



A discussion of the problems of determining vocabulary size 
from text samples. 

Ellegard, Alvar E ngl ish ^J-at in, and morphemic analysis . Gotenborg, 

Sweden: Elanders Boktryckeri Aktiebolog, 1967> 

This is a short discussion and analysis of Latin root words, 
Inflections, prefixes, and suffixes In English, and the deriva- 
tion of seven rules for recognition of morphemic elements in 
English, either in words or separately. 
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93. Estoup, J. B. Gammes s tenog raph i ques . Paris; 1916 (^th edition). 



An early attempt to specify thi> r.>nk-f requency relation of words 
In a text. 

9^. Eyes tone, Maynard M. Subordinate clauses In spoken and written 

American English. Dissertation Abstracts . 1967, 22» 3857A. 

The author analyses clause types and discusses them In a study 
of "...50.000 words of unrehearsed comments by American Journalists 

and n like corpus taken at randoiD from the published works of 
the same people." 

95. Fairbanks, Helen The quantitative differentiation of samples of 

spoken Engl I sh. Psychological Monographs . 19'W», 5£, 19-30. 

Three-thousand word sanples were taken from ten "superior" 
college freshmen anH ton schizophrenics (whose case histories 
are described). The speech of each subject was recorded and 
transcribed by the author. Comparative data Is provided with 
particular attention to type-token ratios, grammatical structures, 
and word frequencies. 

96. Flood, V/. and West, M. Dictionary of Scientific and Technical Terms . 

London: Longmans, Green and Company, Ltd., I960. (2d edition). 

This dictionary contains 10,000 scientific and technical words 
for the layman on 50 subjects. It explains the words with a 
vocabulary of 2,000 words; 56 of which are technical and 120 
nore which may be difficult for children or individuals who 
are not native English speakers. Hov/ever, most of the 120 
words are explained In the dictionary Itself. 

97. Fowler. Marray Herdan's statistical parameter and the frequency of 

English phoncnes. Studies presented to Joshua Whatmough , ed. 

Ernest Pulgram, 's-Gravenhage : 1957, ^7*52. 

This article examines the usefulness of Herdan's "coefficient 
of variation for the sampling distribution of mf.<ns" In an 
examinat on of phoneme distribution in works by Graham Greene, 
Ctirl P. Boyer (a calculus textbook), and Beatrix Potter. 
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Franklin, IK, Melklc, H», and Strain, J. Vocabulary in Context , 

Enjlish Language Institute, University of Michigan Press, \0(>k. 

This book presents vocabulary In context In the order of atten- 
tion pointers (lexical area), presentation (conversations In 
context) I generalization (explanation or notes on the converse* 
tlons), and practice (drill exercises in situational contexts), 

French Ministry of National Education Fundamenta 1 French/ (f 1 rst 

level) (Le Francals fondamental (ler degr^) , Paris, France: National 

Pedagogical Institute, 1959 (2d edition). 

Fundamental French (1st Level) replaces Elementary French (I9ii'*) 
which was created in response to a request by UNESCO In I9'<7 
for a daily spoken language to enlarge the worldwide education 
base, and tn response to a need felt by the French to compete 
with Basic English, while not imposing restrictions on growth 
which are inherent In Basic English, Fundamental French (1st 
Level) Is meant to be the basis for textbooks on French vocabulary 
and grammar to be taught to foreigners as their first real 
Introduction to the French language. It is based on French spoken 
vn as natural a situation as possible, 163 conversations were 
recorded In the Paris area from a wide range of persons, for a 
total of 312,135 running words (tokens) of which 7i995 were 
different (types). From this base, a frequency list was pre- 
pared. It was found that some very useful words were used 
relatively infrequently in both spoken and written French, They 
were usually concrete v/ords such as bus, stamps , and grocer. To 
avoid losing them. It was decided to classify by the term ''Avail* 
abl|lty'\ as well as "frequency*'. Words were listed by alpha- 
abet leal order and then grouped by meaning. The lists were then 
cut by 100 to a figure of 800 words Indicated by frequency as 
valuable because they were close synonyms of others on the list, 
vulgar words, or presented some difficulty of use or learning. 
Some words were added to ensure all essential educative concepts 
would have a means of expression. The list has \kkS Items of 
which 1176 are lexical and 269 are grammatical. For the grammar, 
constructions rare In spoken French were eliminated, such as some 
verb forms, interrogative expressions, and little used grammatical 
words. Both vocabulary and grammar are arranged by essential words 
and those of secondary priority. The vocabulary list Is arranged 
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in alphabetical order. Additionally, grannattcai words, numbers, 
days of the week, months, seasons, and terms denoting human rela- 
tionships are grouped separately. Frequency does not appear In 
the list in order not to influence teachers unduly. For certain 
classes such as vegetables, fruits, domesticated animals, metals, 
tools, and construction materials, only a minimum list was pro- 
vided to allow for regional additions as required. 

100. French Ministry of National Education Le Franca Is fondamental (deux 

deg re) . (Fundamental French. 2d level (stage)) . (Brochure 707, 

E./SR) Paris: National Pedagogical Institute, (1963). 

This book extends the 1st stage of Fundamental French for those 
who desire to acquire a more complete knowledge of the French 
language and culture. It Is based essentially on the written 
language enriched by more precise grammatical words and Is able 
to express thoughts with greater consideration for the affective 
and cognitive nuances. It corresponds to the essential needs 
of the real world. This second stage Is designed to assist the 
learner to read books, newspapers, and periodicals. The vocab- 
ulary includes words from the word list product»J for the Ist 
level with frequencies equal to or greater than 20 (the Ist 
level included words only down the frequency scale as far as 29.) 
It also includes words eliminated \n the Ist level. It Includes 
the remainder of the **aval lable" words not included In the Ist 
stage. It includes new research— new study of varied types of 
printed texts to update the Vander Beke Dictionary, (Vender Bek 
French Frequency Dictionary words with frequencies of 60 or above, 
of 1,147,7^8 running words, but based on written text... and old— 
about 1900), resulting In k2S units of 500 words each not In the 
Ist stage, retaining those with a frequency of 13 or greater; 
study of a terminal textbook on education In civics for the last 
part of primary education, retaining words with a frequency equal 
to or greater than 7; study of psychological vocabulary based on 
studies of 160 students at eight normal schools, retaining words 
used by at least 15 of the 160; additional words not Indicated by 
frequency but Judged by experts to be required, it has an alpha- 
betically listed vocabulary. The grammar Is extended beyond that 
of the Ist stage to Include constructions required to read written 
material, but still not a complete French grammar. It distinguishes 
(in the vocabulary) between words required for active use and those 
required only for understanding words when read or heard. 
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lOL French, Morman R. , Carter, Jr., Charles W., and Koenig, Jr., Walter 



The words and sounds of telephone conversations. The Del 1 System 

Technical Journal, h>30, JJ^, 290-32^. 

•This paper presents data concerning the vocabulary and the 
relative frequency of occurrence of the speech sounds of tele- 
phone conversation. Tables are given showing the most fre- 
quently used words, the syllabic structure of the words, the 
relative occurrences of the sounds, and, for each vowel, the 
percentage distribution of the consonants which precede and 
follow it. Comparisons are made with the vocabulary and relative 
occurrence of speech sounds in written English/' 

102. Fries, C. The Structure of English . Mew York: Harcourt, Brace, and 

Company, Inc., 1952. 

This book is a continuation, extension, and expansion of Fries' 
••American English Grammar"" (English Monograph No. 10 of the 
National Council of Teachers of English, 19^0), with respect 
to the sentence. The materials for analysis were more than 
250,000 running words of standard English conversation recorded 
mechanically in Ann Arbor, Michigan. There were some 300 In- 
formants who provided about 50 hours of diverse conversation. 
Emphasis Is on the grammar of structure of oral English, as 
opposed to the -jramn^ar of usage based on differences of writing 
of socio-economic classes. This has led to the identification 
of patterns of oral English. Unfortunately, Dr. Fries has dwelt 
more on the analysis of his findings than on his procedures in 
arriving at then. The last chapter, 13» on practical applications 
has nuch that is of use in the teach I nq of English to those for 
whom it is not the native tongue. 

103. Fries, C. S., and Fries, A. C. Foundations of English Teaching. 

(The English Language Exploratory Cor-imlttee) Tokyo, Japan: Kenkyusha, 
Ltd., 1961, 

This book is ofie which provides a basis for building textbooks 
and teacher's guides for teaching English In Japan, especially 
to the first three grades of the lower secondary schools in 
Japan. It contains structures (patterns) and vocabulary, it 
emphasizes dialogue as the form of teaching; i.e., of the 
structure and pattern of word usage in English. The essence 
of the procedure for vocabulary selection Is in chapter 1. (The 
Nature and Function of a Corpu ; with corpus being defined as 
vocabulary and structure of social situational frames.) It does 
not give frequency counts, but supplies basic vocabulary and 
situational context for the use of words of the vocabulary. 
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lO'l. Fries, C, and Traver, A. English Word Lists - A Study of Their 



Adaptability for Instruction * Ann Arbor, Michigan: the George Woke 

Publishing Company, 1950 and 1965* 

This work discusses English Mord lists, vocabularies, and the 
procedures for selection of vocabulary (frequency count, minimum 
basic lists, or psychological criteria). It Is a critical Inquiry 
Into the character of the lists and their applicability to teaching 
English to non-English speaking learners. Specific discussions 
are included on the following: Ogden— basic English, West— defi- 
nition vocabulary, Palmer and Hornby (I RET) —standard English voca- 
bulary, Thornd ike— teachers ' word books, Faucett, Palmer, West, 
and Thorndike--interlm report of vocabulary selection, Faucett and 
Maki— 153^ words and values of I to }kt and Aiden— little English. 

105. Frumkina, F. M. Statisticesk l a metody Izucenija leksiki (Statistical 

methods of vocabulary study) . Moscow: 

The author discusses both general problems and proposed models 
(e.g., Zipf's "law" of the statistical properties of the lexical 
structure of texts). Procedures for compiling a frequency dic- 
tionary are described. The text includes an appendix listing 
the most frequently used words In Puskin's lexicon and the 
statistical properties of Puskin's texts are given with part- 
icular attention devoted to the type-token relationship* 

106. Frumkina, F. M. Allegemine probleme der haeuf igkei tswoerterbuecher. I RAL , 
I96A. 2, 236-2^7. 

The author reviews the (then) recent word counts of Garcla-Hoz and 
of Josselson, and proposes a method based on the Zipf function which 
will make it possible to compile a list with precision about a 
given percentage of the words in a text. First, a numerical esti- 
mate Is made of the lowest frequency that in any particular list 
can be reached within a predetermined margin of error, then the 
size of the corpus is calculated which is necessary to determine 
the given frequency within the stated margin of error. Ms. Frumkin 
concludes with a list compiled according to her method which demon- 
strates its usefulness. 
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107. Frumklna, M., A. P. VasMevlch and Y.N. Gerganov» Sub'ektivnye otsenki 

chastot elementov teksta I zriteinoe vosprlyantte rechevot informatsll 

(Subjective estimates of the frequencies of textual elements and the 

visual reception of spoken Information). Nauchno-Tekhn. Infor. Prots. 

SIst. . 1970, 2, 20-24. (In Russian) 

"A discussion of the results of an experimental testing of the 
following hypothese: (1) the occurrence probability of meaning- 
less letter combinations in speech predicts the threeshold of 
their visual recognition; and (2) subjective estimates of pro- 
babilities of letter combinations as obtained by psychometric 
techniques are a stronger predictive factor than the estimates 
of the same prcbabllltes obtained by text counts. Tests were 
made using Russian trigrams presented tachlstoscoplcal 1y. The 
results Justify the assumption that prediction of an individual's 
behavior In a new situation Is based on subjective estimates of 
probabilities of the situation structures." 

108. Fry, Dennis The Frequency of Occurrence of Speech Sounds in Southern 
English. Archives neerlandalses de phonetlque experlmentale . \3k7, 
20, 103-106. 

Fry examined a corpus of 17,000 sounds of southern American 
English and provided frequency dafa— based on the transcription 
system formulated by Daniel Jonur^, 

109. Fucks, Wllhelm On the mathematical analysis of style. Blometrtka . 

1952, li, 122-129. 

"Every significant text of a gramntatlcal exposition consists of 
a certain material, the vocabulary, and some structural properties, 
the style, of Its author. The passive vocabulary is formed by 
the totality of all words of that language, s, the author writes 
in, the active vocabulary Is formed by a certain set, s ' , of 
that totality, the selection of which Is determined essentially 
by the sort of literature the text belongs to and depends only 
in a lower degree on the peculiarity of the author. Style, 
however. Is characteristic of the author at a certain period 
of his personal development. The aim of the following Investi- 
gation Is to formulate mathematically some of the peopertles of 
structure constituting style, so that for a given text the 
application of a simple mathematical criterion allows Its 
attribution to a particular author at a certain period of his 
mental development." 
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110. Fucks, Wilhelm Mathematical i;heory of word formation. Information 

Theory , ed. Colin Cherry, London: 1956, 15^-170. 

The author hopes to discover "whether the process of word 
formation out of syllables in literary texts obeys a law which 
can be given mathematically." Word-length data from texts by 
Shakespeare, Aldous Huxley, Sal lust, and Caesar Is considered. 

111. Gammon, Edward R. A statistical study of English syntax. Proceedings 

of the Ninth international Congress of Linguists , ed. Horace G. Lunt, 

The Hague: 196^, 37-'t3. 

"This paper summarizes a statistical approach to English syntax. 
We show a segmentation of utterances based on the estimated 
sequence of forms ot' an utterance. We require that segment bound- 
aries occur at positions in the sequence where the uncertainty 
in predicting possible future forms, given one or more Immediate 
forms, is high. By 'high' we mean either in a relative sense, or 
larger than some prespecifled value. The segments obtained from 
sequences of distribution classes coincide with recognizable 
phrases^ Using various systems of phrases labeling, predictability 
of phrase types yields recognizable clauses and sentences; al- 
though these do not necessarily coincide with intonation patterns 
indicated by punctuation." 

112. Garcia Hoz, V. Vocabulario usual, vocabulario commun, y vocabulario 
fundamental (Usual vocabulary, common vocabulary, and fundamental basic 
vocabulary) . Madrid: Consegu Superior de investigaciones Cientlfices 
(Supreme Council for Scientific investigations, institute San Jose de 
Calasaz), 1953. 

An interesting distinction is made between active vocabulary used 
for speaking or writing and a passive or latent vocabulary used 
for word recognition as ^n reading or listening. The sources of 
the common vocabulary were private letters, periodicals, official, 
religious, and trade union documents, and books. Note that they 
are all printed or written sources. The book goes into consider- 
able philosophical discussion on the number of words In the corpus 
to (}et a fair sample of different words for each type of source 
of v^rds and In the determination of what constitutes a 'S>rard". 
As finally decided on, the usual vocabulary includes 12,^28 wo'ds. 
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arranged In alphabetical order with frequencies listed tor each 
of the four sources as wet) as a total frequency. The usual voca- 
bulary is supplemented by 369 words of restricted usage—malnly 
technical and by 253 words which are not In the dictionary. The 
"vocabularlo commun" Is more restrict ted. It consists of words 
found In all four sources listed under the usual vocabulary. The 
common vocabulary contains 1971 words and a supplement of words 
which do not appear In alt four types of sources, but still have 
a total frequency of kO or more, Including some which reach kO by 
combination of related forms of a basic word. The fundamental or 
basic vocabulary Is the most restricted. The list consists of 208 
words whose frequency Is nearly equally distributed among all four 
sources, provided that th« total frequency Is above 40. Some 26 
words of high frequency (over ^lOO) were eliminated from the list 
because of unbalanced frequencies with respect to tetters (as a 
source); 19 were too high and 27 were too low. In addition, there 
are sections on correlation among the four sources, on factorial 
anlaysis, and a conclusion. 

113. Garvin, Paul (ed.) Natural Language and the Coff.uter . New York: 
McGraw-Hill, 1963. 

This Is a collection of 16 original essays concerning all phases 
of computer-aided studies of language. 

114. George, Alexander L. Quantitative and qualitative approaches to content 

analysis. Trends In Content Analysis , ed. Ithiel de Sola Pool, Urbana: 

University of Illinois Press, 1959, 7-32. 

A survey of experimental design problems and methods of quati- 
f tcatlon. 

115. George, H.V., An Inventory of simple sentence patterns of English. 
Proceedings of the Linguistic Society of New Zeal and. I967-I968, 10-11 . 
62-66. 

A brief inventory Is presented here. The author demonstrates the 
pr/sslb!lity of a language model being constructed rather than pre- 
senting a comprehensive analysis of the English Language. This 
model, he asserts, wouH be of some value to teachers of English 
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and might serve as a tool for future research. He describes how 
he constructed the model using 13 elements and trying to organize 
a system by hand. He found that to be of little value, and he 
used a computer which allowed htm to list his elements and mutually 
excluding items. Then he requested permutations from the computer. 
These results were then culled for items that no examples could be 
constructed from. The results were then listed and produced a 
count of 518 patterns. The elements and codes were: 

s. Subject, 

si. Subject formal it. 

St. Subject formal there. 

o. Object. 

p. Predicative adjunct to the subject, 

no. Not. 

ne. Never. 

f. Finite verb, except am, are, is, was, were, 

fb. Finite verb, am, are, is, was, were. 

a. Auxilary except items of fb. 

vs. Non-finite verb stem, 

vd. Non-finite verb stem 4ed. 

vg. Non-finite verb stem +ing. 
His 12 most frequent patterns were: 

o. 

p. 
fp. 
fo. 
a. 
no. 
p no. 
no p. 
no o. 
no fp. 
no fo. 
ne a vs p. 
His 12 least frequent patterns w«re: 
ne f st p. 
St ne a. 
St ne a vd. 
a ne vd. 
ne a st vd. 
St a ne vd p. 
p st ne a vd. 
p st a ne vd. 
a st ne vd p. 
ne a st vd p. 
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115. (continued) 

The main problem with the worl( Is that it does not present a com- 
prehensive analysis. The size of the corpus from which this 
count was drawn is not mentioned and the number of sources used 
is not discussed. The method of onalysis is not explained deeply 
enough. However, the author's goal appears to be to establish a 
model for future research rather than a large analysis of the 
language or languages In general. 

116. Gibson, James W., Gruner, Charles R. , KIbler, Robert J., and Kelly, 

Francis J. A quantitative examination of differences and similarities 

in written and spoken messages. Speech Monographs . 1966, 2i> 

This study examined the possible differences and similarities In 
spoken and written style. Using ^5 speech students, the authors 
had them write essays and make speeches on given topics. Using 
several s methods of analysis, they concluded that spoken style 
was more interesting and simpler to understand. 

117. Cllmore, T., and Kwasa, S. Swahili Phrase Book for Travelers . New 

York: Frederick Ungar Company, 15)63. 

A little broader In scope than most word and phrase books for 
travelers, this book attempts to cover a broad area of dialec- 
tical variation with a single word and phrase list of "essentials" 

118. Good, I. J. Distribution of word frequencies Nature , 1957, no. 
^559. 595. 

This Is a very brief iterr on the relation between ZIpf's 
rank-frequency hypothesis and Shannon's entropy. 

119. Gougenheim, G., Hichea, R., Rivenc, P., and Sauvage, A. Elabora- 
tion on Fundamental French . Paris, Franck: Oidier, 196^. 

The introduction explains the reasons for fundamental French 
in some detail. Part l--Chapter 1 provides a history of simpli- 
fied (basic) vocabularies. Chapter 2 ulscu&ses Basic English. 
Chapter 3 discusses statistical methods of language analysis 
indicating their history and a preference for the statistical 
methods over the logical/subjective methods. It refers to the 
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French frequency dictionaries by Henmon (l92'i-^00,000 written words) 
and discusses them briefly. Chapter k discusses scholarly words 
derived from the Vender Beke study, mainly Basic French Dictionaries. 
Chapter 5 deals with two lists, (I) the Arlsttzabal list of 1938 
which was based on 1^00 letters written by adults in several sit- 
u<itions, 4100 compositions by children of various school grades 
and 25 stories invented and told by gifted children. The authors 
came up with 460,727 running words containing 12,038 different 
words of which 4329 had a frequency of greater than 10. Th.^^ 
Dottreme-Hassarents List which was based on prior lists (Ar i&«-.izabe1 , 
Haygood, Vender Belce/Prescott) and on studies by Dottreme himself. 
The final vocabulary count contains 2750 words arranged by fre- 
quency, difficulty of spelling, and a quotient resulting from di- 
viding the frequency by the difficulty of spelling. The Appendix 
to Part 1 refers to frequency counts in languages other than French 
and English. Part 2, Chapter 1 describes In detail the method of 
obtaining the samples of Fundamental French. It Includes a list 
of 1063 words In order of decreasing frequency. Next Is the list 
of alphabetical order. Chapter 2 contains studies on frequency/ 
grammatical relationships. Chapter 3 discusses relationships be- 
tween literary and spoken French and gives a table Indicating Rank 
by Frequency and a value for the Zlpf Constant (f x r ) where 
a - 1.305 on a corpus of 312,135 words of which 7>995 are different. 
Part 3 discusses the problem of availability versus frequency, the 
available vocabulary and the degree of Its availability, the psy- 
chological stability of concrete words, sociological and geographical 
differences, and complementary research studies. Part 4 contains 
the vocabulary, including additions and deletions, notes on grammar, 
and Verification Measures. The appendices include Extracts of 
Recordings, Examples of instructional Tests written in Fundamental 
French and a Bibliography. 

120. Graham, E. Basic Engl ish-lnternation Second Language . Orthologlcal 

Institute, New York: Harcourt, Brace, and World, Inc., 1968. 

This work combines and updates Ogden's Basic Engl Ish and the 
ABC's of Basic English . 

121. Green, J. R. A comparison of oral and written language: a quan- 
titative analysis of the structure and vocabulary of the oral 
and written language of a group of college students. Disserta- 
tion Abstracts , (1959), 19, 2080-208l . 
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"From the language data, tabulations were made of the num- 
ber of times each different word was used, the letter- 
lengths of each word, frequencies of parts of speech, fre- 
quencies of main and subordinate clauses, and frequencies 
of various kinds of subordinate clauses. Verbid equivalents 
of finite clauses were counted in a further study of ratio 
of subordination." The material studies consisted of 13i684 
words of speech and 18,4'(7 words of writing. 

122. Grecnway, P. J., A Swahi 1 i-Botanical-Engl Ish Dictionary of Plant 
Names . Tanganyika (Tanzania): Da r-£s -Salaam, 13^0. (2d Revised 
Edition) . 

The book lists botanical names in both Swahi 1 i-EngI Ish, and 
Engl Ish-Swahi 1 1 orders. Brief descriptions of each plant, 
tree, or shrub are provided as English translations for the 
Swahlll terms. 

123. Cross, M. Mathematical Models In Linguistics . Englewood Cliffs, 

New Jersey: Prentice-Hall, Inc., 1972. 

Structural linguistics deals with the properties of natural 
languages that are best accounted for In terms of combina- 
tions of simple elements into more complex ones. There are 
laws that restrict the combinations. In the, last 20 years, 
research in linguistics has reached the degree of complexity 
and precision such that the use of mathematical tools has 
become the only safe way to state the descriptions. This 
book presents a number of such tools. In terms of standard 
mathematical notations. It also attempts, in a more general 
way, to demonstrate how the tools can be used in linguistics, 
especially in the construction of models. 

12^. Gruner, Charles R. , Kibler, Robert J. and Gibson, James W. A 

quantitative analysis of selected characteristics of oral and 

written 'Vocabularies* Journal of Communication , 1967 jT. '52-158. 

"Th( ourposes of this study were: (1) to develop a list of 
the snty-flve most frequently used words for both oral and 
written messages; (2) to compare these word lists with simi- 
lar lists developed from previous research; and (3) to de- 
termine the difft-ences and similarities between written and 
spoken vocabulsrie. :~ measured by the type-token ratio." 
Forty-five college i. . • nts provided the data for the anal- 
ysis. 



125. Gullbert, Louis De ] 'util Isation de la statistlque en lexlcologle 

appllquee. Etudes de LIngulstlque Appllquee . (1963) 2^, 12-2^. 

This Is a general consideration of the use of statistics In 
language studies with particular attention to the problems 
presented by Idiomatic phrases, grammatical variants, and 
semantic relations. 

126. Gulraud, Pierre BIbl lographle critique de la statistlque 1 Ingulstlque. 

Utrecht-Anvers: 195^. 

A multilingual collection of books and articles arranged by 
content category. 

127. Gulraud, Pierre Les caracteres statistiques du vocabularle . Paris; 
1954. 

More than half of the book is devoted to problems of the 
analysis of lexical distribution in literary texts. The 
concluding sections are devoted to a presentation of lexi- 
cal data derived from the study of poems by Baudelaire, 
Rimbaud, Mallarme, Apollinaire, Valery, and Claudel . 

128. Harwood, F. W., and Wright, A. M. Statistical study of English 

word formation. Language , 1956, 32^, 260-273. 

"A quantitative study of English vrard formation based on the 
data of the Thorndike-Lorge frequency list. Results cover 
(a) dimensions of the word forming mechanisms in modern 
English, (b) measures of the relations between major suf- 
fixes and word classes, and (c) the main equivalencies sym- 
bolized by the major suffixes." 

129. Haydon, Rebecca E. The relative frequency of phonemes In general- 
American English. Word (1950), j^, 217-223. 

This article presents the results of the analysis of six 
classroom lectures by different speakers transcribed In the 
system developed by Kenneth L. Pike. 
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130. Hays, 0. G. Introduction to computational linguistics . New York: 



1967. 

This Is a textbook Introduction with frequent Illustrations 
of computer algorithms for linguistics and exercises for the 
student. 

131. Henmon, V. A. C. A French word book, based on a count of ^00,000 

running words . (University of Wisconsin Bureau of Educational 

Research Bulletin No. 3, September 192^.) Madison, Wisconsin: 

University of Wisconsin College of Education, 1924. 

This Is a count of printed and written French as of the time 
i.e., prior to 1924. The study details the particulars of Its 
compilation. It was intended as a companion piece to Thorn- 
dike's "The Teacher's Word Boo/." on English and Kaedlng's 
"Frequency Dictionary" ("Haeuf Igkelts-woerterbuch") on Ger- 
man. The 400,000 running words were reduced to 9187 on a 
dictionary basis. Of these, 3905 occurred five times or 
more. They are printed In the book In order of frequency 
(Part 1) and alphabetically (Part 2). Words occurring 5000 
times or more account for 25% of running discourse. There 
are only ten such words, but they Include the verbs "to be" 
and "to have" with all their conjugations subsumed under 
the infinitive. 655 words occurred 50 times or more and 
1250 (Including the 655) occurred 25 times or more. Unfor- 
tunately, Henmon did not give any details of the techniques 
Involved In the corpus selection, he only indicated Its gen- 
eral breakdown as to source. Nor did he Indicate how he 
developed and refined the word count. However, the book Is 
significant as one of the earlier word counts of some scope. 

132. Herdan, G. A new derivation and interpretation of Yule's 'Char- 
acteristic' K, Journal of Applied Mathemi. .cs and Physics (ZAMP) , 
(1955), i, 332-34. 

"Yule's 'Characteristic' J< of the word-frequency distribu- 
tion of a linguistic text Is derived under the assumption 
that the occurrence of a word in such texts was governed by 
a law of chance (the Poisson law). This assumption, and 
with it the use of K as a characteristic of the text, has 
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132. (continued) 

been attacked by linguists, without foundation, as the 
writer believes. However, the constant j< can be derived 
without such an assumption, which has noT only the advantage 
of obviating adverse criticism of the kind referred to above, 
but of showing K to be an easily interpretable, useful and 
interesting characteristic of a linguistic text*" 

133. Herdan, G. The relation between the functional burdening of 

phonemes and the frequency of occurrence. Language and Speech . 

(1958), 1, 8-13. 

"The frequency of occurrence of phonemes in a language may 
be derived from dictionary material or from continuous texts. 
This paper deals with the relation between the two sets of 
values for English. When distiributions are plotted for Eng- 
lish phonemes, classified according to manner and place of 
articulation, it is ^een that there is a close similarity 
between the distribution for dictionary material and for 
continuous texts. The hypothesis Is advanced and tested 
that the phoneme distribution in speech is a random sample 
of the phoneme distribution In dictionary material (the 
functional burdening of phonemes)." 

13^. Herdan, G. Quantitative Linguistics . Washington, D. C: But- 

terworths, 1964. 

The thesis of this book Is that mathematical linguistics Is 
an Integral part of linguistics, and not just some tool used 
on an a£ hoc basis to obtain statistical data. Herdan's 
concept embraces and ties together deSaussure's and Bloom- 
field's and differentiates among: 

La Langue—the language viewed as an entity. 
La Langage—col leer ive human speech as an entity (quan- 
titative linguistics) which is somewhat 
different than La Langue, and 
La Parr>le--actual Individual human speech or utterance 
which diffars from both of the above. 
Herdan has divided his book into four parts and 20 chapters, 
together with an appendix which provides a numerical table 
of the law of solidarity (in the use of v/ords it Is the sys- 
tem of vocabulary as revealed In the gradient of frequencies). 
The four broad categories of analysis and exposition In Her- 
dan's book are: (1) quantitative linguistics, (2) phonemic 
level, (3) vocabulary level, and ik) syntsx level. 
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135. Hcrdan, G. The Advanced Theory of Language as Choice and Chance . 



Berlin and New York: 1966. 

The most recent and most comprehensive of Herdan's textbooks, 
this work gives particular attention to matters related to 
styl Istics. 

136. HIbbett, H. and Gen. I tasaka Modern Japanese - A Basic Reader . 
Cambridge, Massachusetts: Harvard University Press, 1965. Volumes 
i and ii. 

Prepared with the assistance of an HEW (USOE) contract the 
two volume text contains vocabulary lists and notes (Volume 
I) and Japanese Text (Volume II). This Is a textbook which 
should not be studied until after basic (or beginning) Japanese 
has been mastered, it attempts to use the most frequently used 
Japanese words as determined by the 1957 and I96O vocabulary 
studies by the Japanese National Language Research institute. 
Although It Is In current or modern Japanese, It is written/ 
printed word, rather than spoken word oriented. 

137. HI1I, Archibald Oral Approach to English . Tokyo, Japan: The 

English Language Education Council, inc. 1965. and 2^. 

This book consists of progressive drills in spoken English 
without reference to the origin of word selection. Drills: 
understanding sounds by contrast, producing sounds, use of 
sentence patterns and sequences, substitution frames, dialogues, 
and grammatical transformation practice. 

138. Hill, L. Selected Articles on the Teaching of English as a 
Second Language . London: Oxford University Press, 1967 (1969 
reprint) . 

The author has spent most of his adult life teaching English 
to foreigners In their own countries. He has also written 
many articles on how to do It. Eighteen of the ones he con- 
siders best he has inserted Into this compilation. The 
articles are full uf useful hints on how to teach English, 
In somewhat the same vein as Stevlck's "Helping People to 
Learn Engl Ish ." 
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139. Holstein, A. P. A statistical analysis of Schizophrenic language: 



preliminaries to a study. Statistical Methods In Linguistics . 
1965, i, 10-14. 

This article contains statistical sumnarles and a brief dis- 
cussion of "20 minute samples of the speech of 8 schizophrenic 
patients" with comparable data from published counts. The 
author calculates word class distribution, Yule's K, and the 
type- token ratio and lists the most frequently used words. 

\kO, Horn, E. A Basic Writing Vocabulary . (University of Iowa htono- 

graph In Education. First Series No. k, April I, 1926) Iowa City, 

Iowa: College of Education, University of Iowa, 1926. 

This Is a vocabulary based on the 10,000 English words most 
commonly used in writing. It contains an identification and 
critical review of earlier writing vocabularies as well as 
the methods used in developing the 10,000 word vocabulary. 
The author points out the value of the list In teaching 
English to foreigners since the first 500 words common to 
his. Thorndll<e's lists, and spol<en vocabulary lists mal<e 
up 75-00% of the running words In English. This Is a worth- 
while, albeit somev^iat dated study. 

\k\, Horowitz, W. and Berkowitz, A. Structural advantage of the Mechan- 
ism of spoken expression as a factor In differences in spoken and 
written expression. Perceptual and Motor Skills , (1964), J[9., 619- 
625. 

Type-token ratios were computed as part of this study of 
writing and speech. 

142. Horowitz, H. W. and Newman, J. B. Spoken and written expression: 
an experimental analysis. Journal of Abnormal and Social Psy- 
chology . 1964, 68, 640-647. 

"Two experiments were designed to test for the differences 
between written and spoken expression. These two modes 
were controlled by limiting time for the preparation, time 
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for exposition, and by limiting the subjects to two balanced 
topics. . . " Type-token ratios were calculated among other 
measures. 

IA3. Howes, 0, A word count of spoken English. Journal of Verbal Learning 
and Verbal Behavior . 1966, (6), 572-606. 

Howes undertook this research In order to up-date and correct 
what he considered deficiencies in prior counts, especially 
Thornd ike's omission of spoken English; the French, Carter, and 
Coenig Telephone Count (1930) being designed to record speech 
sounds rather than words— a method of collection was not from 
running connected samples—sampling was restricted; Fairbanks 
(19^^), small corpus— 30,000 words-only those words with a fre- 
quency of 100 or fnore were published. Informants for the Howes' 
study were 20 sophomores at Northeastern University and NIT and 
20 VA Hospital patients who had acted as controls for his prior 
studies on aphasic speech but were themselves free from cerebral 
defects or acute deh Ibi I i tating diseases. Informants were taped 
in free speech In response to general questions designed to get 
them talking na teal ly. All recordings took place between I960 
and 1965. There were SO interviews of 5000 words each. The 
41st (VA) inS-ormant provided 10 of the 50 Interviews in order 
to provide data on stability of word frequency. The kO others 
were each Interviewed only once. The total corpus was 250,000 
running words from k\ sources, which were catalogued as to indi- 
vidual source as well as to class of source; i.e.. University or 
VA Hosptial. There were 9699 words in the corpus, of which a 
little less than half {k7 percent) occurred only once. The 
author notes that the type/token ratio of spoken English tends 
to be less than it would be In written or printed English, and 
that only very large counts will produce evidence of extremely 
rare words. (Bongers says at least a million. This count Is 
only 25 percent of that amount). The results are tabulated in 
an alphabetical list giving total frequency (all k) informants) 
and separately University (20) and VA Hosptial (20) frequencies. 
The Informant Interviewed 10 times appears only In the total 
column. Words with a frequency of one are listed linearly to 
save space, but are annotated to indicated whether they were 
used by a University student, one of the 26 hosptial patients, or 
by the one VA patient interviewed 10 times. In spite of its 
limited sampling, this Is a useful count since It Is recent and 
embraced hospital patients from a variety of backgrounds (although 
probably mainly from the lower middle, and lower class) as well 
as students, most of whom were probably in their 14th year of the 
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educational process. Within the student groupi although Howes 
did not break them out as such, he had some variety, at least of 
academic Interest, and probably also of socio** economic background. 

Hultzen, L, Allen, H., Jr., and Miron, M. Tables of Transitional 

Frequencies of English Phonemes . Urbana, Illinois: University of 

ininots Press, 196^. 

The frequency of occurrence of transitional probabilities of 
small units In normal text may be different In sequences 
following any other given unit from what It Is when the pre- 
ceding unit Is taken Into consideration. A phoneme Is de- 
fined as the least unit for which a distinction must be made 
In a language. Phonemic analysis yields more usable data 
than analysts by spelling letters. The objective Is to set 
up an apparatus for describing the set of phoneme sequences 
occurring In a running text of language. The corpus us«d 
for the study was drawn from 11 different plays In the pub* 
Meat Ion "Plays - The Drama Magazine for Young People," pub- 
lished by the Journal of Modern American English. Selections 
were one page each. They were run together to obtain a total 
of 20,032 phonemes, including Junctures as phonemes. The 
phonemic analysis follows that Professor Agard used In the 
Southwest Project In Comparative Psychol Ingu 1st I cs . In this 
case, he spoke the selected excerpts of the 11 plays In his 
modified version of the southeast New England dialect. The 
phonemic notation was that used by Trager and Smith In their 
Outline of English Structure. In presenting the tables and 
corpub, an IBM printout was used with Its limitation of cap- 
ital letters and a few non-literal symbols. Tabular displays 
Include the number of occurrences of single through four pho- 
neme sequences* The fourth order sequences are also tabulated 
by reverse Indexing. In Chapter 2, Section I, there are several 
breakouts and elaborations on the tables In Part II, includirig 
frequency by types and tokens. In Chapter 3» Section I, there 
Is a discussion of messages generated by computer on the basis 
of transitional probabilities. 

1^5. Ichiro, S. Basic Vocabulary for School Children (Kyoiku Klhon 

go!) . Tokyo: Makl Shoten, 1958. 

This vocabulary totals 22,500 words for use In the nine years 
of compulsory education of Japanese children. It Is graded 
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as follows: 

lower Primary — - 5,000 words, 
higher Primary -- 7,000 words, and 
junior High -™ 10,000 words. 

The words were selected on the basis of subjective criteria 
by a panel using dictionary sources. 

U6. Jakobovltz, L. A. Foreign Language Learning (A psychol I ngulstlc 

analysis of the Issue). Rowly, Massachusetts: Newbury House, 

1970. 

This book Is an attempt to unscramble some of the confusion 
between language teaching (methods and materials) and lan- 
guage learning (psychol Ingulstlcs and human variables). It 
Is has five chapters: 

1. Psychol Inguistic Implications of Teaching of Foreign 
Languages , 

2. Psychological and Physiological Aspects of Foreign 
Language Teaching, 

3. Compensating Foreign Language Instruction (Teacher/ 
Learner/Researcher/Eval uator) , 

A. Problems of Assessing Language Proficiency (see items 

on testing), and 
5. Foreign Language Aptitude and Attitude References 

(Bibliography) . 

147. Johnson, D. B. Computer Frequency Control of Vocabulary in Language 

Learning Materials. Instructional Science . Amsterdam, The Netherlands: 

Elsevier Publishing Company, March 1972, (1), 121-131. 

"Vocabulary Is one of the major obstacles to attaining reading 
fluency In a second language. .. For efficient learning, the voca- 
bulary systems must be structured In terms of frequency groupings 
so that the more frequent ones are mastered before the less fre- 
quent ones. ..The solution involves: (1) the establishment of 
various word frequency groups and (2) marking the word in the 
reading text so that the learner has a clear set of rational 
priorities. Statistical studies suggest that approximately 5000 
most freuqent words constitute a minimum vocabulary for "liberated" 
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reading and account for about 90 percent of the different words 
In an average text... the presentation of the higher frequency 
words within the IOOO-5OOO range shouh' be sequenced by groups in 
terms of their relative frequencies. Each group might correspond 
to a particular ievel of language proficiency. This goal can 
be attained by means of a system In which the frequency category 
of each text word Is marked so that the learner knows Its rela- 
tive Importance and can structure his vocabulary acquisition 
accordingly. A marking procedure by frequency Is integrated with 
a marginal translation or glossing routine. The article propos»js 
a set of frequency groups and describes an algorithm for the im- 
plementation of a frequency identification and marking procedure 
on an IBM 36O copmuter..." Although the article Is devoted to 
reading skills It has obvious application to oral vocabulary, 
once determined, and Its Integration Into oral sentence patterns 
or other methods of learning conversation. 

U3. Johnson, F. A Standard Swahi I i-EngI Ish Dictionary . (For the 

InterterrI torlal Language (Swahi I i) Committee), London: Oxford 

University Press, 1939. 

Madan's Dictionary (1903) was based on the language of 
Zanzibar City. This update broadens the geographic base 
of the word coverage. Nouns and other forms derived from 
verbs are listed under the verb rather than separately. 
The dictionary Includes loan-words from Persian, Hindi, 
Turkish, Arabic and neighboring Bantu languages, in addi- 
tion to a small number of Portuguese, German and English 
borrowings. (See Berritt, D. V.'s Dictionary.) 

Ug. Jones, L. V. and Wepman, J. M. A Spoken Word Count . (PHS Grant 

MH 018^*9 and M-10006 (University of Morth Carolina). PHS Grant 

MM OI876 (University of Chicago)). Chicago Illinois: Language 

Research Associates, 1 966. 

The vocabulary was compiled from English-speaking adults who 
were each asked separately to tell a story about 20 pictures 
In Murray's Thematic Apperception Test of 19/43. It was dis- 
covered that In spoken language, 33 words account for S0% of 
the words used. (An analysis of the Lorge and Thorndike 
Word Lists shows 89 words are required to account for 50% 
of their written word sample.) The most frequently used 
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words are used more frequently by speakers thar writers. 
The book has three lists: 

List A - 1102 words most often used by speakers; each 

word has a frequency of at least 4/100,000, 
List B - Words spoken by at least two of the respondents, 

arranged by granrnatlcal class, alphabetical 1v 

wi thin class , and 
List C - List B in completely alphabetical order, and In* 

eluding inflectional forms. 

A table shows the ratio of male/female u'iage cf words as 
well as the ratio of persons under/over 60 years of age, 
recognizing that education probably has more to do with 
the variance than the categories listed. Zipf states that 
there Is a relationship between word length and frequency 
of use. This study supports his thesis up to words four 
letters In length; after that the relationship Is not exact. 

150. Jones, R. M. Situational vocabulary. IRAL . 1966, jj^, 165-173. 

Relating to the concept of selecting vocabulary according to 
the Idea of availability (disponsiblllty) , Jones discusses 
objective means for selecting "centers of interest" by advancing 
fairly rigorous definitions of •'situation" or "center of interest" 
and for using objective criteria to list the "centers of Interest" 
which are to be investigated. Jones discusses "open" and "closed" 
situations, "positioned" and "unposi tioned" situations and re- 
commends the development of an "Aristotelian" hierarchy in class- 
ifying vocabularies by situation; in effect, a situational 
taxonomy . 

151. 'oos, Martin Review of Zlpf's the psycho-biology of language. 
Language . 1936, 196-210. 

A detailed critique of major significance. Joos proposes 
a modification of Zlpf's rank-frequency equation. 

152. Joos, M. The English Verb Forms and Meaning . Madison, Wiscon- 
sin: University of Wisconsin Press, 1968. 

Joos says that If German Is hard to learn because of Its 
nouns, English Is hard to learn because of its verbs. He 
divides his book into: 
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'.hapter I Introduction, 

Chapter II Non-Finite Verbs, 

Chapter III The Finite Schema, 

Chapter IV Basic Meaning and Voice, 

Chapter V Aspect, Tense, and Phase, 

Chapter VI Assertion 
Appendices. 



153. Jorden, E. The syntax of modern colloquial Japanese - Language , 
(Ji, No. 1 (Pert 3), January-March 1955.) New York: Krauss 
Reprint Corporation, I966. 



The author states that her purpose Is to give a systematic 
,ind complete, description of the syntax of modern colloquial 
Japanese and Incidentally to formulate a new technique for 
analyzing language. The study Is based on a corpus of 
60,000 spoken words from the Tokyo area. Most i iformants 
were men and women between the ages of 20 and 5C, repre- 
senting varied professions and family backgrounds. All 
were native speakers of Japanese, educated at least through 
high school level. Topics talked cn w^re anecdotes, personal 
experiences, and conversations be '.ween individuals. Some 
spontaneous speech heard in Tokyo was also recorded. Some 
contemporary newspapers and magazine artlc'es, some Inter- 
views, round-table discussions, dialogues, and comic strips, 
and some fiction were also added, so that In Its entirety 
the study was not completely of the spoken language. How- 
ever, the written material was recorded as spoken by an 
Informant. Material of a formal written style was omitted. 
Utterances were broken down Into successively smaller se- 
quences until the maximally independent (IC) sequence was 
reached {Lexeme). All sequences were then categorized 
(classified). The dissertation describ«><: its method, 
materials, procedures, the system of classification, and 
its application. This study contains two appendices: 



Lexeme Classes and 

Constituent Types (of sequences). 



I5A. Josselyn, H. The Russian Word Count and Frequency and Analysis 
of Gra'^iatical Categories of Standard Literary Russian . Detroit: 
Wayne University Press, 1953. 
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This word count provides data dealing with the distribution 
of vocabulary and structural categories of standard 1 Iterary 
Russian. The time-frame is the second quarter of the 19th 
century to the present ( circa 1950). The time samples were 
taken as fol lows: 



\S% — 19th Century, 
2S% " 1900-1918, and 
50% — 1918 to about 1950. 

The classification of samples according to style is as 
fol lows: 



drama, 

\k% literary criticism, 

20^ Journalism (wide scope within magazines and news- 
papers) , and 
59^ fiction. 



The material Is condensed into six lists: 

List 1 Is 20<» most frequently used vards out of 150,000 
running words. 

Lists 2 through 5 are the first 2000 words In groups 
of 500, arranged In alphabetical order, and 
List 6 Is th'i next 3>000 most Important words for 3d 
and later y^ar studor.ts of Russian. 

Tabulations do not include proper names with some exceptions. 
Inflected nouns are entered only as the masculine singular, 
all inflected verbs are entered under the Infinitive, dialectical 
Items are entered separately except for verbs, and dialectical 
verbs are referred to In their proper infinitive. 

There is a tabulation of gramriatical usage of several well- 
known authors. Special computer source and punch cards 
were prepared for essential data of the categories desired. 

The total number of running ^vords examined was 1,000,000 
Tne total number counted was 526, 0^^ 

The different words recorded were ^1,115 
The total significant words published In the 

1 ists were 5,230 
The final lists show both range, frequency, 
chronology (period), type literature, and 
conversational or non-conversational, source. 

Lists are given In order of range rank. The index Is alpha- 
betical with a List Key for each word. 
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155. Kaeding, F. W, Haeuf iykeitswoerterbuch der Deutschen Sprache 



(Frequency Dictionary of the German Language) . Berlin: Mittler 
and Sohn, I898. 

This book Is In German. It Is a frequency count of words 
and syllables )n German and Is one of the earliest still 
cited frequency counts. The book begins with a review 
of literature on the subject, requirements for such a study 
(especially for stenography), prior studies, lists of 
source materials, and procedures followed. Some 11,000,000 
words and 20,000,000 syllables were counted. Their ratio 
In the study Is then 1/1.83. There are several tables 
which present material alphabetically and in frequency 
rank order. In the main, alphabetical table Inflections 
are listed under the headword. This was a comprehensive 
and thorough work for Its time. 

156. Karlgren, Hans Positional models and empty positions. In 

Structures and (Quanta; Three Essays on Linguistic Description . 

Copenhagen and New York: lp63, 22-56. 

A discussion of the value of statistical considerations In 
a slot-and-f i I ler model of language. 

157. Kell, Rolf-Oietrlch, Elnhel tl Iche Methoden in der 1 Ixikometrle. IRAL 

1965, 3, 95-122. 

After discussing various problems associated with lexicology, 
the author proposes that the corpus used should contain at 
least ten million running words, with single text containing 
no less than ten thousand running words, and that the functional 
weight given to text classes should correspond to the relative 
importance of these classes in the language as a whole. An 
extensive bibliography accompanies this arttcle. 

150. Kihouka, T. Japanese language guide for Secondary School Teachers . 

South Orange, N.Y.: Seton University, 196'*. 

This guide has five parts: approach, planning, materials, when 
to use Hlragana, Katakana and Kanhli, and >:valuation. 




I 



159. Koch I, D. Basic Japanese (Kiso Nippongo) . Tokyo; Kokuselkan, 1933. 

This Is a Japanese parallel to Basic English by Ogden. The idea 
was to streamline Japanese for Instruction and especially to aid 
In teaching Japanese to non- Japanese speakers. The book Is div- 
ided Into three parts: sentence rules, basic reader, and Includes 
tOOO word vocabulary, 

1 60. Kochi, D. Shades of Japanese (Nippongo no sagata) . Tokyo: Kaizosha, 
19^1. 

This book is a group of articles by the author, including one 
called "Kisogo (Basic Japanese)". It also contains a basic word 
1 ist of 1 100 words. 

161. Koutsoudas, Andreas M. , and Macho 1 , l^obert E. Frequency of occurrence 
of words; a sUid^ of Zlpf's law with application to mechanical trans- 
lation . Ann Arbor: University of Michigan Engineering Research Insti- 
tute, Report no. 2U^-U7-T, June 1957. 

"Existing laws concerning the freqencles of words in language — 
specifically Zlpf's and Joos' laws— are examined by means of new 
formulas which permit comparison of these laws with easily obtain- 
able data. The laws are shown to be Inaccurate and Inadequate for 
predicting the size of dictionary necessary for mechanical trans- 
lation, or the frequency with which words not in a dictionary of 
given size will be found. It Is concluded that an enplrlral 
approach to this problem is most promising." Appendix A (pages 
7-13) by George J. Hlnty summarizes the mathematical basis of 
the new formulas. 

162. Kramsky, J. The frequency of articles in relation to style In English. 

Prague studies In mathematical linguistics , I967, 2, 89-95. 

The Investigation of the statistical distribution of definite. 
Indefinite, and zero articles In contemporary English reveals 
that there are not significant differences In the usage of 
articles in various s'tyles. 

163. Kraus, Jlrf K stylu soudobe ceske reklamy (on the syle of contemporary 
Czech advertising). Nas Rec. I965, ^8, 193-198. 
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163. (continued) 

A statistical comparison of broadcast advertising with that of 
•^^'WB newspapers, based on a part-of-speech count and a word repetition 

164. Krlshnamurthy, K. H. Psychol I ngu 1st Ic study of a schlzophrene's speech. 
L anguage and Speech . I969, J^., 256-257. 

"An analysis of a schlzophrene's speech using a phonological 
system of notation Is presented here. Grouping the utterance 
data Into phonemic and non-phonemic phonatory, the latter In 
turn Into the normal and occasional etc., rather than phonemic 
and prosodic, is shown to be more comprehensive and useful. 
The system aims at Incorporating many fresh utterance details 
like stretches, response time, rate of phoneme production, tone- 
accent distribution and the 1 Ue In an edited transcript which 
Is also serially numbered In such a way as to help pinpoint 
discussion of any portion. This Is shown to be a useful method 
of bringing out many features of psychol Inguistic Interest, 
fuch as the general description of a subject's phonatlon for 
comparative study, the richness and close correlation of the 
devices to the mood and contents, etc. It also shows that 
the v/ay of using phonatory devices In active speech Is more 
varied than our native grammatical conceptions Indicate and 
Includes Illustrations of semantic Incoherence and a thought- 
type Involved at many levels characterizing psychotic speech." 

165. Kroeber, Karl A computer analysis of fictional prose style . 

Washington, D.C.: Office of Education, 1966. 

"Fundamental characteristics of fictional prose style were studied 
through systematic and objective analyses of novel Istic syntax 
and vocabulary. Sample passages from the major novels of Jane 
Austen, the Bronte sisters, and George El tot, as well as novels 
by 13 other authors were analyzed. Information on sentences, 
clauses, and words was coded and transfered to magnetic tape. 
Statistical tests were run on the data, and frequencies of syntac- 
tic patterns and vocabulary preferences were printed out. The 
primary conclusions of the study were (1) it is not possible to 
define the style of any novelist through simple statistical analy- 
sis of his grammar or his word choice, (2) novel Istlc style can be 
satisfactorily Identified only In terms of multiple factors, many 
of which go beyond the level of syntax and vocabulary, and (3) 
further systematic study of fictional prose style should be based 
on automated analysis of tests, as the human analysis of texts 
requires an exorbitant amount of time." 
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166# Krohn R. English sentence structure (sentence patterns) , Ann Arbor: 

English Language Institute, University of Michigan, University of 

Michigan Press, 1971. 

As the title implies this book deals with patterns or frameworks 
rather than frequency counts. 

167* Kublin, H, Useful Japanese pronounclation and basic words. New York: 

Japanese Society, 1961. 

This booklet is very short and basic. It is for the traveler and 
beginning student. It has two sections: Alphabetical Lists of 
Words and Classified Vocabulary; e.g.. Everyday Expressions and 
Date^Time Expressions. There is no indication of how the words 
and phrases were selected. It Is essentially a short, traveler's 
word and phrase book. 

168. Kucera, H., and Francis, W. Computational analysis of present day 

American Engl ish > Providence, Rhode Island: Brown University Press, 

1967. 

This analysis was performed by computer on a nearly I million 
word corpus of natural language text compiled in 1963*1964 at 
Brown University. It contains both lexical and statistical 
data. The purpose was to compile a corpus of printed American 
English rather than to develop a basic vrcabulary of most common 
words. The corpus is divided into 500 word samples of about 
2000 words each from continuous discourse. All texts were 
first printed in 1961 and represent a wide range of styles, 
i.e., 15 categories: press, 3 (reporting, editorial, review), 
religion^ skills and hobbies, popular lore, literature and 
biography, miccel laneous government documents, learned and 
scientific, and fiction 6 (general, mystery, detective, science, 
adventure/western, romance, and love story). Samples were 
randomly selected. The analysis is in two main parts: word 
lists and statistical tables and graphs. Word lists are: de- 
scending order of frequency, alphabetical » first hundred most 
frequent words by total and the 15 categories, word frequency 
distribution, and sentence length distribution (corpus as a 
whole: 19.27 words; range: 25 .^9" 12.76 for Government Docu- 
ments (miscellaneous) and fiction/mystery, respectively. 
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IC9. Kucera, H., and Monroe, G. A comparative quantitative phonology of 

Russian. Czech, a n d German . New York: American Elsevier Publishing 

Company, Inc., I968. 

This is a book on computational linguistics financed, In part, by 
the National Science Foundation, institutional Grants and Facili- 
ties Grants. The project reported on was designed to test the use- 
fulness of well defined quantitative procedures in phonological 
analysis, especially comparative and typological studies, with 
emphasis on the latter. The study explores. In addition to phono- 
tactics, the relative frequency of individual phonemes and phoneme 
strings, probabal istic constraints on the occurrence of phonemes 
In specified positions In relevant linguistic segments (i.e., 
syllables and words), or restrictions on sequences of larger 
phonological units. The research Is valuable to historical phon- 
ology, revealing differences in historically related languages. 
The basic mathematical procedure uses the concepts of Information 
theory. The first step was the phonemic analysis and transcription 
of a significant body of data in three languages. The corpus 
consisted of 100,000 phonemes for Russian and Czech and 105, 17^ 
for German. Sources for printed texts of 20th Century authors 
included 60 percent prose fiction, 20 percent Journalistic press, 
10 percent poetry, and 10 percent scientific and scholarly. The 
data were placed on punch cards in standard spelling with Russian 
transliterated into the Roman alphabet. An algorithm was constructed 
to transform the graphic presentations into a phonemic one. After 
a test, this part was done automatically. Some statistical counts 
were performed along with the transcriptions in Russian and Czech. 
The German text was pre-edited by separating prefixes from the item 
by using hyphens. German transcription was semi-automatic. After 
corrections, the statistical information was written onto magnetic 
tape. Chapter k Is devoted :o defiii^ng the phonological syllable. 
The three corpora were chosen to be comparable In content and style* 
Calculations were performed to determine entropy and redundancy. 
There is an ls>otropy Index of two parts: Isotropy proper or 
phonotactics (matching phonemes in corresponding syllabic positions) 
and isomorphy--quant!tatlve similarity of phonemes. The concept 
of language divergence equals the difference between the actual Isotropy 
Index and the maximum possible value of the Index. This difference 
turned out to be least between Russian and Czech; middle for Czech 
and German, and the greatest for Russian and German (as might be 
expected). Conclusions: Close genetic relationships of two lan- 
ua<jes are likely to be shown at the phonological level in similar 
phonotactics, but not necessarily in very similar phonemic systems 
(as Russian and Czech). Languages in close r.ontact (as Czech and 
German) may well show greatest similarity of phonemic Inventory 
but less in phonotactical or phonological levels. 
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Lachman, R. Lachman word count frequency table . New York: Department 
of Psychology, State University of New York, October 1967 (a computer 
readout) . 

It Is based on a corpus of ^»65,^52 words. Including punctuation. 
There are 18^61 different words. In September I965, 976 students 
of both sexes took AO minutes each to write on any subject except 
psychology. There are two tables: alphabetical and frequency ranking. 

Lado, R. Annotated bibliography for teachers of English as a foreign 

language . Qui let In 1955, No. 3, US Department of Health, Education 

and Welfare, USGPO 1955. 

It contains material for the teacher. Including tests and vocab- 
ularies or word list^. and materials for students, with brief notes 
about each item. 

172. Lado, R., anci Fries, C. C. English sentence patterns . Ann Arbor: 

English Language Institute, University of Michigan Press, 1961, 1. 

This book has for its purpose the understanding and production of 
English grammatical structure by means of an oral approach. It 
contains simple Intermediate and advanced patterns. It Is adaptable 
for use with various levels of sfjdent ability. It states that 
learning a new language consists not so much of learning about 
the language as In developing a new set of (thinking) habits. It 
has exercises for developing the new required new "hfiblts". Each 
lesson has: an outline, a frame (including attention pointer, 
structural pattern, and comments) , 1 1 lustrative examples , practice 
exercises, notes, and a review. 

173. Udo, R., and Fries, C. C. English pattern practice . Ann Arbor: 

English Language Institute, University of Michigan Press, 1958, 2^. 

Supplements Volume I with practice material . Procedures are 
entirely oral. The basis is a shift from mere Imitation a^d 
repetition of patterns through conscious choice of elements of 
structure to be learned from exercises In which attention Is 
centered upon a variety of lexical meanings subst 1 tutable In the 
structural frame. It Is one of 3^ units of the intensive course 
In English at Michigan. It is based on the Idea that to learn 
a new language one must orally establish the patterns of the lan- 
guage as a subconscious habit. The pattern rather than particular 
sentences is the target of learning. I.e., the significant frame- 
work. 
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174. Lamb, Sydney M. The digital computer as an aid in linguistics. 
Language . I96I, 382-^112. 

A general Introduction to computers and to problems solvable by 
techniques of "mechanol Ingu Istlcs". (Available In the Bobbs- 
Merrill Reprint Series in Language and Linguistics . no."5?T 

175. LeBreton, F. Up-country Swahlli exercises , (for the soldier, settler, 

miner, and merchant and their wives.) Richmond, Surrey, England: R. W. 

Simpson and Company, Ltd., 1944. 

This boolt tries to adapt the limited Swahili of the hinterland 
for use by individuals who have to move inland. It Is largely a 
book of grammar, vocabulary and pronounclation. There is a special 
vocabulary on military terms, a Swahl 1 i-Engl Ish and an English- 
Swahill vocabulary, and a l;ey to the exercises. 

176. Light, Richard L. A study of some factors involved in teaching technical 
vocabulary to foreign military trainees learning English . Master's 
Thesis in Applied Linguistics, Georgetown University, November 

1964. 

Ninety percent of the study group was foreign naval personnel 
(FY 64). The audiolingual approach to language was employed. 
Materials for teaching technical vocabulary In an aural-oral 
teaching situations were found to be lacking. The problems 
were: (l) finding an important technical field common to 
the majority of students (35 specialties involved), (2) 
criteria were word frequency, word importance, safety, and 
US Navy Word List fcr compiling the technical graded word 
list, (3) developing supplemental materials using pattern 
practice for word recognition and structure patterns, and 
(4) classroom trial of materials developed. Analysis 
Indicated that electrical terms ware the most common 
across the specialty fields. The criteria for construction of 
the word list were word frequency counts and the occurrence 
or non-occurrence of words In a US Navy List of electrical 
terms: Appendix A— Weighted Basic Terms In Electricity, 
Appendix B— Alphabetical List of Vocabulary, Appendix 
C--Lesson 1 (Patterns), Appendix O—Qulz with Pictures, 
and Bib I iography . 
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177. Loogman, A. Svyahlll grammar and syntax. (Ouquesne Studies --African 



Series No. 1) Pittsburgh, Pennsylvania: Duquesne University Press, 
1965. 

The author developed this text from his experience gained from 
37 years in Swah 1 1 l-s peaking Africa and the teaching of the language. 
Part 1, Morphology, Includes preliminary studies, nouns, qualifiers, 
substitutes (pronouns), verbs, adverbs, prepositions and conjunctions, 
idlophones, enonatapea, v/ords , and interjections, and parsing. 
Part 2, Syntax, Includes sentences, nouns, qualifiers, substitutes, 
verbs, binders, verb forms, auxiliaries, directive verbs, passive, 
to be, and to have. A bibliography is also Included. Part 2 of 
this text is particulary valuable. 

178. Loogman, A. Swahili readings . Pittsburgh, Pennsylvania: Ouquesne 
University Press, 1967. 

The purpose of this text was to help the student advance his study 
of the Swahili language from basic grammar and syntax to a profitable 
contact with well -writ ten Swahili in order to provide an oppor- 
tunity for observation, analysis and imitation. The materials 
were selected from a wide range of subjects and types and Include: 
educational materials, histories, folklore, 1 i terary wrl ting, 
journalistic material, oratorical material, letter writing, and 
poetry. The materials are divided into lessons each of which Is 
accompanied by exercises in translating English into Swahili. A 
key to the exercises Is at the end of the book. 

179. Lorge, i., and Thorndike, E. A semantic count of English words. New 

York City: Institute of Educational Research, Teachers' College, 1938. 

This is an account of the frequency of occurrence of each meaning 
of each word, i.e., a semantic count based on 2,250,000 words and 
the Thorndike 20,000 most common words (early version of the 
30,000 word list). It Is « hectograph r«nrf>Hured In three-ring 
binders by alphabetical groupings. 

180. Lorge, I. The semantic word count of the 570 commonest English words . 

New York City: Teachers' College, Columbia University, \SkS, 

This book contains the relative frequency of occurrence of the 
different meanings of each of the most common words. It supple- 
ments the Lorge and Thorndike List of 1938 for the 570 most common 
words In Engl ish. 




18). Mackey, W. F., and J. G. Savard The Indices of coverage: a new dimen- 



sion In lexicometrlcs. IRAL . 1967, 5, 7I-I2I. 

Describes research Into the development of Indices of coverage 
or availability. The usefulness of a word considers the power 
of a word to define, to extend Its meaning, and to Include or 
to combine with other words. The authors conclude with a table 
of 3i626 words arranged In decreasing order of Index of coverage, 
together with separate ratings for deflnatory combinational, 
incluslonal, and extensional power. 

182. Mackey, W. F., Savard, J. G., and Ardouln, P. Le Vocabulalre Disponlble 

du Francals (The vocabulary of available words of the French language) . 

(In French) Montreal: Dldier, 1971, 1 6 i- 

The purpose of Volume 1 is to document the differences and simil- 
arities of concrete words used In France and in Acadia. The pur- 
pose of Volume 2 Is to concentrate In more detail on the concrete 
words as used in Acadia, documenting the findings on Acadian child- 
ren according to age and considering the effects of bl 1 ingual ism. 
For Volume 1, word usage was tabulated In New Brunswick, Canada 
and four regions of France. The sessions with the Informants were 
held from 1961-63, with the majority in 1962. In Canada, the 
sessions centered around 22 areas of interest (27 for bilingual 
children). The Informants were 17^5 school children from ages 
9'18, located in ^47 classrooms in 19 schools scattered throughout 
New Brunswick. The total corpus numbered 900,000 words. Concrete 
vocabulary was elicited by using as stimulus words the basic word 
of the center of Interest, such as "animal", "body", and "trans- 
portation". Each child was given 15 minutes to write all the words 
he knew related to the center of interest. Only 2-3 centers of 
Interest were covered at each session. In France, the inform- 
ants were about 700 school children ages 9-12 in about 20 classes 
In zs many schools, the total corpus numbered 300,000 concrete 
words as derived frum over 16 centers of interest. The indice*^ 
used to determine vocabulary were: frequency of use, distribution 
(number of persons writing each word), valence (the powers of a 
word to combine Into compounds or Idioms, to act as a synonym for 
another, to explain other words, and to express completely or 
slightly different meanings). The running words in the corpora 
were reduced down to 10,000 different words but these 10,000 were 
expressed in some 6^,000 forms, (i.e., the average word was 
spelled In six different ways by the 2500 children involved). On 
the average, the 10,000 words were used by at least 27 children. 
However, closer analysis revealed that almost 5,000 were used by 
only one child. That means that the common vocabulary is 5,000 
concrete words or less based on their use by more than one person. 
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183. Malcolm, J. A classification of five thousand words most commonly used 



In writing, as compiled by Dr. Ernest Horn In accordance with the 

principles of the new standard course--PI tman shorthand . Masters' 

Thesis, New York: Teachers* College, Columbia University, 1939. 

This Is a comparison of the Manual on "New Standard Course- 
Pitman Shorthand with "A Basic Writing Vocabulary of 10,000 
Words by Or. Ernest Horn. Ms. Malcolm's analysis Indicates 
that the Pitman Manual required revision to mat<e it conform to 
actual word frequency usage as reflected in the Horn list. 

IS'*. Mandelbrot, Benolt An informational theory of the statistical structure 

of language. In Communication theory , ed. Willis Jaclcson, New York 

and London: 1953, W-502. P 

The author concludes his discussion of statistical models and 
Saussurean linguistics with the observation that "a quite general 
statistical structure, entirely Independent of meaning, appears, 
underlying meaningful written languages." 

185. Marchand, H. Tha cayegorles and types of present-day Eng^lish. Word 

formation (a synch ronic-diachronic approach) . Wiesbaden, West Germany: 

Otto Harressewltz. I960. (Also Auburn, Alabama: Alabama Linguistics and 

Philosophical Series No. 13, University of Alabama Press, 1967.) 

Although the author calls his approach synchronlc-diachronlc, he 
starts off by emphasizing It is meant to be up-to-date, although 
not all Inclusive, preferring general types of words to their 
variations. Historical data on word changes is used only Incid- 
entally. After the Introduction, the author deais with compounds, 
HreTlxatlon, and suffixatlon. He then proceeds In less detail 
to cover zero-morphemes, back-derivation, phonetic symbolism, 
ablaut and rime combinations, clipping (omitting part of the word 
in speaking), blending, and word manufacture. This book presents 
a comprehensive picture of the composition of English words. 

186. Marchand, M. Five thousand Frenc idioms. Paris: Em Terquem, 1910. 

This is a book for advanced students who are learning French as 
other than their native tongue. It includes Gallicisms, proverbs, 
and Idiomatic adverbs, adjectives, and comparisons. This edition 
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186. (continued) 



is a revision of an earlier work encompassing some 4,000 idioms. 
The scope of the bool< embraces some 170 subject areas giving words 
^nd expressions valuable to enlarging vocabularies of those who 
already know some French. Unfortunately, the author does not 
explain his sources very well and the book is not current by some 
60 years. 

187. Martin, S. E. Basic Japanese conversation dictionary . (Revised and 

Enlarged) (Engl ish'Japanese and Japanese-English) Tokyo: Charles E. 

Tuttle Co., 1963 (8th printing). 

This Dictionary contains 3>000 "useful" English words with their 
most frequent meanings and their Japanese equivalents. It is 
meant for use with Martin's works on easy Japanese and essential 
Japanese. Unfortunately, it gives no rationale for the selection 
of the words it contain. 

188. Martin, S. Horphophonemes of standard colloquial Japanese. Language , 

New York: Krauss Reprint Corporation, I966 (originally July-September 

1952) 28, (3) (Part 2). 

This study represents the first attempt to make a systematic study 
of Japanese morphophonemes on a synchronic level. An attempt was 
made to keep the analysis on a formal level, separate and distinct 
from semantic correlations. 

189. Martin, S. E. Easy Japanese--a direct approach to immediate conver- 
sation . (3rd revised edition) Tokyo: Charles E. Tuttle Co., 1968 
(17 printing). 

This bock h^«; four n^rtc. h with a Wuiu or Two (Lessons 1-13), 
Add a Bit of Action (Lessons U-20) , Sprinkle in a Few Particles 
(Lessons 21-30), and 3000 Useful Japanese Words. The Japanese- 
English part of Martin's Basic Japanese Conversation Dictionary. 

190. Maw, J. Sentences In Swahili--a study of their Internal relationships . 
London: Luzac and Company, Ltd., 1965. 
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190. (continued) 



This study was originally a Ph.D. Dissertation for the University 
of London. The theory and method are those of Profossor M.A.K. 
Hal 1 1 day. The materials were collected In 196^4-1965 near Tanga, 
Tanzania. They were almost entirely spoken and spontaneous. They 
deal with units larger than the word from a syntactic rather than 
from a morphologic point of view. The book does not carry the 
study down Into the structure of the word and morpheme. The study 
material was taped in the form of conversations between nscive 
speakers, stories, anecdotes, and discussions. Some expert 
testimony of scholars was added to the field research. 

191. Maw, J. Review of 'Swahill readings' by A. Loogman. Journal of 

African Languages . Hertford, England, 1968, 2 (Part I). 

Maw 'ays that 'Swahill readings' is a collection of Swahill texts 
from various sources and exemplifying different styles of Swahill 
writings; some by natives, some not. Unfortunately, the author 
has not indicated the source well enough to permit knowing which 
author is a native speaker, ^he book was intended to help students 
improve their Swahlli. It failed In its purpose because Father 
Loogman did not take the time to analyze the texts and derive 
useful lessons and experience from them. 

192. Mayajl, HIroshi. A frequency dictionary of Japanese words. Disserta- 
tion Abstracts . 1 967, 22, 3^'»2^-^3A. 

The dictionary is the result of a count of 250,000 words from 
five writing types: fiction, drama, didactic prose, periodical 
writing, and scientific writing. 

193* McCalla, Gordon I. and Sampson, Jeffrey R. HUSE: A model to understand 

simple Enaii'ah. Communications of the ACM . 1972, 15(1), 29-'»0. 

"MUSE Is a computer model for natural language processing, based 
on a semantic memory network like that of Qui 1 1 Ian 's ILL. MUSE, 
from a Model to Understand Simple English, processes English 
sentences of unrestricted content but somewhat restricted format. 
The model first applies syntactic analysis to eliminate some 
Interpretations and then employs a simplified semantic inter- 
section procedure to find a valid interpretation of the input. 
While the sematic processing is similar to TLC's, the syntactic 
component includes the early use of parse trees and special 
purpose rules. The "relational triple" notation used during 
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193. (continued) 



Interpretation of Inpu; Is compatible with MUSE's memory struc- 
tures, allowing direct verification of familiar concepts and 
the addition of new ones. MUSE also has a repertoire of actions, 
which range from editing and reporting the contents of Its own 
•memory to an Indirect form of question answering. Examples are 
presented to demonstrate how the model interprets text, resolves 
amblgultes, adds information to memory, generalized from examples, 
and performs various actions." 

\3k. McCarus, Ernest N. and P-..imuny, Ra j i M., Word count of elementary modern 

literary Arabic textbooks. University of Michigan, 1968. 

"A computerized word count Is presented of 11 elementary Modern 
Literary Arabic textbooks used in the United States. The word 
count was started In 1967 to provide a practical vocabulary base 
for a fully-programmed self- instructional course on the phonology 
and script of Modern Literary Arabic. The first part of the 
count is a cumulative list (complied with the aid of an IBM 
360/20 computer and an IBM card sorter) arranged alphabetically 
by Arabic root, according to conventional dictionary practice, 
of all the words listed in the II Arabic texts, with their 
English meanings, the sources for each word are given. The 
number of different textbooks In v/hich the v^rd occurs is 
indicated, as well as its frequency In the Landau word count 
(1959). Plurals are listed separately but following the 
singulars, and the Imperfect tense of the verb Is likewise 
listed following the perfect, if both occur. Homonyms having 
distinct plurals are listed as separate items. The second 
part of the word count consists of alphabetical list of the 
words occurring In all II textbooks, in 10 of them, in 9, 
' and so on. A list of the II textbooks covered In the count 
is also included. 

195. McGovern, W. Colloquial Japanese . London: Rout ledge and Kegan-Paul, 
Ltd., 1968. 

This is essentially a Japanese granwiar based on British experience 
in training naval personnel in Intensive Japanese classes. The 
student gets a general survey of the scope of the language before 
being Introduced to the details with graded exercises.. It contains 
a Japanese-English vocabulary. 
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196. Melor, Helmut. Deutsch sprachstatlstik . Hlldeshelmj \3(>k. 



The book provides numerous data on German (esp. phoneme and word 
statistics) based on Kaedlng's frequency dictionary and on a variety 
of continuous texts representing different styles, 

137. Mlllc, Louis T. Style and styllstics; an analytical bibliography . 
New York and London: lp67. 

Some eight hundred I terns devoted to styl lstics arranged chronolog- 
ically in five sections: Theoretical, Methodological, Applied, 
Bibliographies, and Omnibus Works, items are annotated and indexed 
subject and topic. 

198. Miller, C. A. Language and communication , (revised edition) New York: 
1963. 

Chapter ^, "The Statistical Approach" (pages 00-99) gives a survey 
of major studies In statistical linguistics and Introduces the 
student to the basic problems of the field. 

199 Miller, G. A., Newman, E. B., and Friedman, E. A. Length-frequency 

Statistics for written English. Information and Control. 1953, J_, 

370-389. 

"The results of a tabulation of word frequencies In a sample of 
written English are analyzed In terms of word length and syntactic 
function, it is found that a simple stochastic model gives a rough 
prediction for the results obtained when all words are combined, 
but not when words are classified as function or content words. 
Function words are short and their frequency of occurrence is a 
decreasing function of their length; content words are longer and 
their probability Is relatively Independent of length." ZIpf's 
and Mandelbrot's "laws" are discussed. 

200. Moore, W. , and Ogawa, Y. ^00 sentence patterns with creative sentence 

patterns (English) . Japan; Hosel University Press, IPS'*. 

This text designed for use of Japanese students of English has 
three parts: Central Problems of Grammar — ^0 fundamental or ele- 
mentary English sentence patterns. For each pattern there Is a 
creative sentence pattern designed to stimulate use of the basic 
pattern. There Is a vocabulary of families of words for varying 
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the word use In the sentence patterns. Also included are 
pronouns, the verbs "to be" and "to have"; auxiliary verbs, and 
the present and present progressive tenses; Intermediate Patterns- 
additional tenses of verbs, conjunctions, idiomatic expressions, 
and additional families of v^/rds ; and Advanced Patterns— addi t ional 
conjunctions, gerunds, quotations, relatives, infinitives, and 
subjunct Ives . 

201 » Morgan, B.Q. Germ/^n Frequency Work Book . (American and Canadian Committees 

on Modern Languages) New York: The Macmilllan Company, 1931, 9. 

This study revises Kaedlng's Haeuf igkei tswoerterbuch der Deutschen 
Sprache (see above) and uses Its findings In the construction of 
a German vocabulary for teaching purposes. To correct Kaedlng's 
work, Morgan reduced the words to their stems and he describes 
the system he used to accomplish that. The author admits that 
there are limitations to the study In terms of its age (I898), 
but he feels that Kaedlng's wide use of sources and his large 
number of running words justify its use. The ithor also pres- 
ents two word lists which were the results of his study. The 
first list shows the basic v/ords he derived by using stem words 
and those which had a frequency of 200 or more. The second 
list Is an alphabetic list of the words with their frequencies. 

2C-2.> Muller, r.harles Le MOT, unite de texte e.^, unite dc lexique en statls- 

tique lexicologique. Travaux de linguistlque et de Utterature . I963, 

i, 155-173. 

A detailed discussion of tt<e problems of defining the "word" for 
lexicographical and statistical purposes. 

203. Muller, Charles Frequence, dispersion et usage: a propos des diction- 

nalres de frequence. Cahiers de Lexicologle , 196S, Jj 33-^2. 

This article argues that both frequency and dispersion must 
be taken into account in the preparation of word lists for 
language teaching. The Frequency Dictionary of Spanish Words 
(Jul 1 land and Rodriguez) is discussed in some detail from 
this point of view. 

204. Muller, Charles, Frequence des signifies ou frequence des signifiants. 
Etudes de Llnguistique Appliqudc . (In French) 1971, i, 7^-87. 
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"A comparison of the frequencies of French words with those 
of Spanish words having the same semantic contents, according 
to "The Romance languages and their structures" by S. Juilland. 
Passing from the total of frequencies to the fundamental data 
provided by each word, a correlation is revealed between the 
frequencies observed in the two languages. It appears obvious 
that the frequency of words depends on the stylistic situation 
represented by the categories of texts used for each group of 
words. The probability of using a lexical element is deter- 
mined much more by the situation than by the lexical structure 
of the language, and frequency is related to the signiflant 
as much as to the signifi6." 

205. National Institute of Health Seminar on computational linguistics. 

(Public Health Service Publication ^1/16) Washington, O.C.: Department 

of HEW, October 1966. 

This Is a report of s seminar among linguists and National 
Health Service Personnel. There were 13 presentations. 
Host deal with machine analysis of language with emphasis 
on syntax, primarily and semantic meaning, secondarily. It 
Is a valuable document in revealing trends in linguistics, 
particularly that of syntax versus phonology, the Increased 
attention being given to semantics, and the use of computer 
assistance in langauge studies. 

206. The National Language Research Institute (of Japan) A research of news- 
paper vocabulary . Tokyo: Yoyuya, Sinzyuku, 1952 (In Japanese). 

The main parts of this report are outline and scope, lists 
of words used In a month in a newspaper Including words 
used more than 10 times, words used more than 100 times 
listed In order of frequency, and analysis (frequency of 
words by day, frequency by article, news Item and classifi- 
cation by parts of speech). These words are listed in 
Japanese Alphabetical order. Although useful this research 
suffers from being more In depth than breadth. I.e., only 
one newspaper was used. It Is, however, reasonably current, 
having been conducted In 1951~1952. 
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ulary (Gendalgo no go! chosa) Tokyo: The National Language Research 

Institute (kokurltzu Kokugo Kenkyugo) , [Part 1 (1953) & Part 2 (1958)]. 

Part 1, Research on Vocabulary In Women's Magazines (Jujin 
RasshI no Yogo) , was based on sampling the text of one year's 
Issues of two representative women's magazines (3 million 
running words). Part 2, Research on Vocabulary In Cultural 
Reviews (Sogo ZasshI no Yogo), was based on a sampling of 
13 cultural reviews (230,000 running words)., About 4000 
most frequently used words are listed In each case. The 
analyses consider mainly the statistical and semantic struc- 
tures of the vocabulary and word construct ion. Procedures 
used are spelled out in detail. Much use Is made of sta- 
tistical sampling, as opposed to word count methods as used 
by Thorndike and Horn. 

208. The National Language Research Institute (of Japan) Research on the 

vocabulary In a newspaper In the early years of the Meiji Period 

(1877-1878) . Tokyo: Kanda-HltotubashI Tlyoda, 1959. 

There are five main parts to this study: an outline (and scope), 
procedures, tables (vocabulary; high and low frequency words, 
supplemental words, prefixes, and suffixes), analysis (symbol 
combinations; words of three Chinese characters style and vocab- 
ulary), and an appendix, including technical terms used. This is 
an Interesting study although obviously dated. 

209. The National Language Research Institute (of Japan) The use of written 
forms in Japanese cultural reviews . Tokyo: Kanda-HI totubushi , Tlyoda, 
1960. 

This report of research contains two main parts: an outline 
(and scope) and lists and tables including list of words with 
two or more variants, table of frequency distribution of 
Chinese characters, frequency tables of Chinese characters 
with frequencies of one or more with their different meanings. 
At the end of this list is a supplemental list of 1850 
Chinese characters in official use in Japan, and frequency 
of Chinese characters not in the official list, and a list 
of such characters. 
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210. The National Language Research Institute (of Japan) Vocabulary and 

Chinese characters In ninety magazines of today . Tokyo: National 

Language Research Institute, I962. 

This study Is In three volumes. Volume 1 is a general 
description of the project and vocabulary frequency tables. 
The samples are dated 1956. Fields covered Include culture, 
business, popular science, housekeeping, sports, and other 
amusements. The sample used contained 540,000 words from a 
possible 1^0 million. After an introduction, the analysis 
Is tabulated In a series of tables: 7200 Most Frequent Words 
In Alphabetical Order with Their Relative Frequencies, 7200 
Most Frequent Words Arranged in Order of Frequency, Frequency 
Tables In Five Strata by Class of Magazine from which sub- 
samples were taken, Bound-Form Frequency Tables. An appendix 
Is included giving a justification and procedures followed. 
Volume 2, Chinese Character Frequency Tables, 1963, after an 
Introduction, consists of a series of tables: Most Frequently 
Used 1995 Chinese Characters According to Relative Frequencies, 
Most Frequent 1995 Chinese Characters with their Different 
Meanings and Uses, and 3328 Chinese Characters used In Japanese 
arranged In their (Japanese) Alphabetical Order. Volume 3, 
Analysis of Results, 1964, Is arranged under the following 
headings: Tables contain the 1200 most frequent fundamental 
words and the semantic classification of the 700 most funda- 
mental words, statistical structure of vocabulary, usage of 
bound forms with frequency tables, means, and uses as pause 
groups or markers, an analysis of 4,381 compound words, dis- 
cussion of formally similar words as different or same words, 
using a 974 word list and two approaches. This Is a current 
analysis of part of the printed Japanese language. 

211. Newman, Edwin B., and Waugh, Nancy C. The redundancy of texts In 

three languages. Information and Control . 1969, 2» 1^1-153. 

"The procedure that predicts the mean Information per letter 
In a long text by adding the constraint measured between 
pairs of letters in a text has been tested more fully. Re- 
sults are presented to show that with randomized texts there 
Is a close approximation to the Mlller-Madow prediction of 
simple bias. Their samples of English of varying complexity 
show slightly more Information per single letter and much 
more Information in in average letter for the more difficult 
material. Conversely, samples for Samoan, English, and 
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Russian show some constancy In the average information per 
letter In spite of wide differences in the size of their alpha- 
bets. Thus, greater redundance is correlated with a larger 
alphabet. The three samples of English considered are from 
the Bible, William James, and the Atlantic Monthly . 

212. Nice, Margaret Morse On the size of vocabularies. American Speech . 
1926, 2, 1-7. 

This Is a general consideration of the problem of determining the 
extent of an individual's vocabulary. 

213. Nisbet, J. 0. Frequency counts and their uses. Educational Research . 
I960, 3. 51-64. 

The author focuses on the history of vocabulary counts and 
concludes that their value may not be as great as Is usually 
supposed. 

214. Oettlnger, Anthony G. Linguistics and mathematics. Studies presented 

to Joshua Whatmough . ed. Ernest Pulgram, 's-Gravenhage: 1957, 179-186. 

A discussion of the notion of "model" in mathematical linguistics 
with particular attention to those proposed by Condon, ZIpf, and 
Mandelbrot. 

215. Ogden, C. The general basic English dictionary . London: Evans Brothers, 
Ltd., i960. 

This volume uses the 850 basic words and 50 additional Inter- 
national words to explain 40,000 meanings of 20,000 English 
words. 

216. Palmer, H. A grammar of English words . London: Longmans, Green, and 

Scott, Ltd., 1938 (1967 edition). 

This book would be more properly entitled "A Granwnatical Dic- 
tionary of English Words". The author gives 10,000 English 
words with their pronounciation, information on several 
meanings; the inflections, and derivatives, and the context 
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In which the word appears. (Collocation and phrases.) There 
are four appendices: verb patterns, Important grammatical 
categories, measures of time, and Irregular Inflections. 



217. Palmer, H. E., and Hornby, A. S. Thousand word English . London: 
George G. Harrap and Company, Ltd., 1937. 



The authors divide this short book Into parts: an Introduction 
In which they discuss how the words were selected [a combina- 
tion of methods: subjective; objective (quantitative) and 
empirical], and the vocabulary Itself, Including Inflected forms 
(which raise the total real-world words to well over 1000). 
Ti:ls Is an Interesting work on vocabulary selection by the use 
of prior studies revised In the light of experience and person- 
al judgment. 



210. Perrott, 0. V. Concise Swahill and En'.''ish Dictionary . London: 
English Universities Press (EUP) , 1970. 



This dictionary starts off with a concise grammar from her 
"Teach Yourself Swahill Book". It is followed by two sections: 
Swahi 1 l-Engl ish and Engl ish-Swahll I ; both with notes. The 
dictionary contains all the words heard by the author durlncj 
her 30 years in East Africa, plus a selection of words from 
Krapf , Sacleux, and Hadan. The load-words given are mostly 
from Arabic and Hindi, some from the Portuguese and German, 
and a large number from English. In this latter respect It 
differs from the Johnson Dictionary. 



219. Petty, W. T., Herold, C. D. , and Stoll, E. The state of knowledge 
about teaching vocabulary . (Cooperative Research Project No. 3128, 
Contract OE 6-10-120) Champaign, Illinois: Natlona: Council of Teachers 
US Office of Education, 1968. 



The focus of the project is on the teaching of vocabulary 
rather than on developing It. It Is also pitched towards 
native speakers of English, types of vocabulary are form 
(words or phrases) and type (speaking, listening, reading, 
or writing). Other subdivisions are formal, Informal, or 
colloquial. It advises that the teacher decide on the 
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vocabulary to teach then on the aspects of language such 
as grammar, phonology, semantics, and situations of verbal 
contexts. Chapter 5 discusses research design for vocabulary 
studies; a type vocabulary; functions of vocabulary, and a 
sample population. 

220. Pfeffer, A. Index of English equivalents for the basic (spoicen) 
German word list . (Grundstufe - 1st Stage) Englewood Cliffs, N. J.: 
Prentice-Hall, 1964. 

This book contains the English equivalents of the meanings 
of the basic (spoken) German. The procedure Pfeffer used 
paralleled that of Lorge and Thorndike In prorating the 
relative frequency of a particular meaning to the frequency 
of occurrence of the word. Computer assistance was used 
where appropriate. The corpus of the semantic count was 
derived from taped interviews. Both frequency and range 
were listed In the semantic frequency count as well as In 
the original frequency count. (Range is the number of 
speakers who used the word as compared to the total number 
of speakers contributing to the sample.) There were some 
shifts In words from the basic count because of semantic 
importance and 355 subsidiary word forms were added (16 
nouns. 76 verbs, and 193 adjectives, 16 adverbs, kk pro- 
nouns and 10 contractions). The study of semantic meaning 
helped discover many synonyms resulting from the spectrum 
and diffusion of meaning of each word. 1277 words were 
listed finally. Only the meanings of greatest Importance 
as indicated by actual usage were Included in the list. 
The student learned load aided by semantics Is Indicated 
by the fact that the basic list of nearly 1500 words become 
25 » 000 when major meanings were considered. 

221. Pfeffer, A. Basic (spoken) German idiom list . Englewood Cliffs, New 
Jersey: Prentice-Hall, ^68. 



This is the third In the Pfeffer series of studies on basic spoken 
German. Idioms are restricted word patterns which are the sub- 
stance of communications. They range from word pairs to whole 
sentences. The meanings of some are self-evident. The meaning of 
others Is not. All are characterized by some form of interdep- 
endence of parts and have some meaning different from their parts 
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taken separately. Idioms may be grouped as stylistic (proverbs and 
common places), linguistic (the degree of restriction of word co- 
location), and syntactic (grammatical combinations or formulas). 
Pfeffer defines what he means by an Idiom and discusses prior 
idiomatic lists based on 19th and 20th Century printed prose such 
as Kenniston (Spanish), Cheydleur (French) and Hauch (German). The 
three above mentioned lists were subjective analyses. This Pfeffer 
list is based directly on basic spoken German machine-counts using 
595.000 punched and coded cards which enabled the determination of 
unrestricted words and those in groups of restricted patterns. 
This list also uses some phrases of the utility and empirical 
words In the basic spoken list. Some 7500 oral patterns were 
identified. These were reduced to 1026. An additional 99 (out 
of some 1800) derived from spontaneous aduU writing relative to 
the utility and empirical words In the basic list were added. The 
1026 idioms were restricted in usage to an average of 15 percent 
of the time. However, 1125 (1026 and 99) represent about 85 
percent of German oral idiomatic usage. Interestingly, the percen- 
tage of the words in the Idioms Is greater than 15 percent of the 
basic vocabulary. Also, the percentage frequency of the Idioms 
Is high. The Idioms listed have a frequency/range (f/r) Index of 
3/2 or greater. Idioms are recorded generally in groups of mutual 
key words and arranged alphabetically, they also contain cross- 
references to the component words. Such a list as this Is indis- 
pensable to teaching German. With the other two lists, there are 
some 6,000 meanings and expressions which, if learned, will make 
a student conversant with 85 percent of oral German. 

222. Pfeffer, J. Basic (spoken) German word list . (HEW (Office of Educa- 
tion) Contract SAE 882^ and 0E2-1^-036] Englewood Cliffs, N.Y.: Pren- 
tice-Hall, 196A. 

In this work, Dr. Pfcfcr was in close touch with the developers of 
basic German (Advisory Research Council of the Institute of Basic 
German) and the authors (Goughenheim and Rivenc) of Fundamental 
French. In his Introduction, Or. Pfeffer contrasts subjective 
and empirical approaches to word counts. He finds that the objec- 
tive counts for use In re.vJIng vocabularies gave way In the 1950 's 
to the use of the phonograph and tape recorder to record and anal- 
yze spoken language. He also includes an excellent resume of ^rlor 
works in the field of vocabulary counts. Pfeffer uses what he 
considered to be the best aspects of word collation of the Spanish 
word count prepared by the University of Puerto Rico and of Funda- 
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mental French. He used 59S>000 running words based on kO tape 
recordings made In Germany, Austria, and Switzerland. Each tape 
ran for 12 minutes. In addltkn, he obtained "utility" words 
from 21 tapes on which he collected material from 2000 pupils. 
These tapes yielded ^20,000 verbs and adjectives as well as ^420, 000 
nouns. Tapes were transcribed and each spoken word and word form, 
adequately coded, was transferred single and In context on a 
separate punch card. Excluded were proper names, place names, 
and adjectives derived frcm them, as well as hesitations, repeti- 
tions, and abandoned starts. The 595.000 cards yielded 25,000 
lexical units. Range and frequency were computed. The frequency 
used Indicates the sum of the frequencies of Inflectional forms. 
The 1000 most conwion words (I.e., those with a frequency of '♦O 
or more and a range of 25 or greater) were reduced by criteria 
of applicability, universality, and indlspensabi 1 Ity to 737 
spoken words. Topical or utility counts were made In 82 Inter- 
mediate and high schools In ^8 cities In Germany, Switzerland, 
and Austria. This was done by association with 20 nouns, 12 
verbs, and eight adjectives in a ten-minute period. This yielded 
833,000 terms Including 19,700 nouns, 7,^00 adjectives, and 
6,000 verbs. Of these, 3^7 were finally selected for Inclusion 
in the list. Emphasis was placed on applicability as opposed 
to topicality which resulted In one-third of the words selected 
having an order of rank of 200 or below. The 737 words were 
combined with the 3^7 and rechecked for topicality 1 linl tatlons , 
and then were augmented by 185 carefully selected words based 
on direct or association sequences; words linking the specific 
to the whole and vice versa, missing opposltes, basic derlvawlves, 
topical gaps (e.g., months, metals) and notions such as "deaf". 
The total count numbered 1269. It is arranged first In alpha- 
betical order Indicating families, second by parts of speech, 
and third in order of frequency and origin. 

223. Pfeffer, J. Alan Grunddeutsch, Basic (spoken) German word list . 

Mittelstufe , Pittsburgh University Institute for Basic German, 1970. 

"As a link between the words In everyday use and the sophisticated 
language of the arts and sciences, the 1,536 words of the "Mittel- 
stufe" or Level 2 derive In nearly equal proportion from three 
sources: (1) the spoken or topical language, (2) a collation of 
all significant word lists compiled prior to February 1965, and 
(3) a statistical analysis of some 500,000 words in context pub- 
lished or reprinted during the years immediately preceding. The 
purpose of the list Is to provide the lexical basis for teaching 
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German In the third and fourth year In high school or the second 
year in college. Alphabetized word lists and appendixes indicating 
frequency of usage are included. Extensive reference to source 
materials is made according to topical listing." 

22A. Plmsleur, Paul Semantic frequency counts. Mechanical Translation . 

1957, 1, n-13. 

A consideration of the problems involved in making semantic counts. 

225. Plath, Warren I4athematical linguistics. Trends In European and Ameri- 
can Linguistics. 1930-1960 . eds. Christine Mohrmann, Alf Sommerfelt, 
and Joshua Whatmough, Utrecht and Antwerp: 1961, 21-57. 

A survey of the field with an extensive bibliography; see espec- 
ially "Statistics of Sytle and Authorship", pages 27-30. 

226. Polome, E. C. Swahili language handbook . Washington, D.C.: Center 
for Applied Linguistics, I967. 

This book covers a lot of information on Swahili. It presents 
the phonetics and morphology of Swahili systematically In modern 
terms. The section on phonetics was the irost advanced to date 
in 1967 (see review by Maw). It begins with an introduction which 
covers the historical and geograohical aspects of the language, 
then goes on to sketch its structure, written language, con- 
trasts with English, and literature. The language used is that 
of a cultivated speaker of Zanzibar and of the Mrima coast. 

227. Polome, E. C. LumbumbashI OSwahlli. Journal of African Languages . 

Hertford, England: I968, 2, (Part 1) U-25. 

This article focuses on the characteristics of the creolized variety 
of Swahili spoken by individuals in Lubumbashi (El izabethvi I le, 
Katanga Province, Republic of Zaire). Zaire Swahili is a 
distinct variety, in any event, but this article fixes on indivi- 
duals with no formal education in East Coast (Standard) Swahili 
and v/ho are residents of Lubumbashi. Lubumbashi Swahili is nwst 
like the Zaire dialect of Swahili called Kingwana. It contains 
many French loan words, some of which have changed meaning as 
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used locally. To a lesser extent, English words have also been 
Imported with workers from South Africa. There has also been an 
Influence of local native languages Including spelling changes as 
well as phonetic shifts. Morphological changes have occurred and 
possesslves have been simplified. Syntax of Katangan Swahlll has 
not diverged much from East Coast patterns (similar to Central 
Bantu to which It Is related), although some French patterns have 
been superimposed on the original native patterns. In the lower 
classes, some of the words of East Coast Swahlll have been lost 
and remaining ones have been forced to take on multiple meanings 
to maintain flexibility of expression. Changes are so great that 
colloquial uneducated speech In Katangan Swahlll would not be 
understood on the East Coast, although that of the better educated 
classes In Katanga would be understood on the coast, albeit with 
some difficulty. 

228. Posner, Rebecca The use and abuse of stylistic statistics. Archlvum 
Lingulstlcum . I963, 111-139. 

A critical survey of the field with comments on attribution prob- 
lems, theoretical assumptions, sampling methods, and vocabulary 
studies. 

229. Pressman, A. Common usage dictionary (Engl ish-luss Ian and Russian- 
English). The living language course, New York: Crown Publishing 
Company, 1958. 

This course follows the method of Ralph Welman. It contains 
15,000 basic items and 1,000 essential items. Unfortunately, 
It does not state how the items were selected. It has glos- 
saries of geographical and proper names. 

230. Purin, L. A standard German vocabulary of 2932 words and 1500 idioms . 
Boston: 0. C. Heath and Company, 1937. 

This book contains 2932 alphabetically arranged v/ords, 2000 deri- 
vatives, and 1500 idioms. It is for use in high schools, and In 
elementary and Intermeidate courses In college German. It Is 
based in part on: the Wadepuhl -Morgan (American Association of 
Teachers of German) Dictionary and tho New York State Basic German 
Work List (I93'») and the German Idiom List of C. D. Vial (SUNY 
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1933). The list includes 967 of the most frequently used words 
given In the Wadcpuhl -Morgan Dictionary. The first 500 words 
in the Purin list Include 400 found by Ortman as common to 12 
of the word lists he examined. There are other words which 
were added to previous word lists based on reconmendations of 
experts. They are the so-called "useful'' words similar to 
those used in Fundamental French and other similar compilations 
of basic or first-stage (level) language vocabularies. Eng- 
lish translations are provided for the German idioms. Semantic 
meaning Is explained by examples of the most conwon meanings 
of each Idiom. These examples are, In fact, sentence frames. 
English cognates are also provided where appropriate. 

231. Rapoport, Anatol The stochastic and the 'teleological * rationales of 

certain distributions and the so-called principle of least effort. 

Behavioral Science . 1957, 2, 147-161. 

Criticism of Zipf's principle and Interpretation of the Simon 
and Mandelbrot derivations of the word frequency distribution 
function. 

232. Reed, David W. A statistical approach to quantitative linguistic 

linguistic analysis. Word . 19^9, 5,, 235-247. 

"The two elementary statistical devices presented are those 
which may aid In answering the following questions in quan- 
titative linguistic analysis: (1) How much evidence should 
be collected in order to make a valid analysis of the fre- 
quency of linguistic forms? (2) When may quantitative dif- 
ferences In linguistic material be considered significant?" 
The two devices are the "Standard Error of Proportion" and 
the "Standard Error of Difference". 

233. Richards, Jack C. A psychol ingulstic measure of vocabulary selection. 
Paper presented at the annual meeting of the Canadian Linguistic 
Association, York University, Toronto, June, I969, Eric Accession 

No. E0-035-8$0. 

"Several basic problems In the field of the selection of voca- 
bulary for teaching English as a foreign language are discussed. 
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The nature of word frequency and word availability are con- 
sidered, along with their limitations as measures of the use- 
fulness of concrete nouns. Word familiarity is proposed as a 
psychol Inguistic measure for noun selection, and some experi- 
mental evidence presented to demonstrate its validity. This 
is a preliminary report of a study which updates the 'general 
service list' of Michael West through establishing word 
familiarity figures for some SOOO nouns as well as updated 
frequency figures for written and spoken English." 

23^. Roberts, A. Hood A statistical linguistic analysis of American English . 
The Hague: 1965* 

The author presents "a quantitative analysis of the segmental 
phonemes of a speaker of a Worth Central US Idiolect. With 
the aid of a digital computer, the 10,000-word corpus was 
analyzed with results that should heip fill several needs in 
present-day linguistic study. Among these findings are the 
following: (I) the etymological composition of English according 
to proximate sources by thousands of frequency; (2) the canon- 
ical forms of the words in the language according to the classi- 
fication of the phont3mes as vowel, consonant, semivowel and as 
to place and i7>anncr of articulation; (3) the frequency of occur- 
rence of the phonemes of the language; {k) the average word 
length In phonemes and In syllables by thousands of iprequency; 
is) the relationship between the alphabetic and phonemic systems 
of notation; (6) the frequencies of occurrence of initial, inter- 
vocalic and final consonants and consonant clusters; (7) the 
entropy of English determined by the relative frequencies of the 
phonemes in the corpus and by word leng\th in phonemes and in 
syllables; (8) the transitional probabi I ? t ies of phonemes; (9) 
the Standard Error of a Proportion, the Standard Error of DIf" 
ference between the two proportions, and the Standard Error 
Deviation for consonants and vowels separately and together." 

235* Robinson, W. P. Cloze procedure as a technique for the investigation 

of social class differences in language usage. Language and Speech . 

1965, 8, /♦2-55. 

"Cloze procedure was used to Investigate tn > nature and extent 
of the differences in verbal behavior of working and middle 
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class boys. Words were deleted In sentences taken from 'formal' 
and ' Informal ', middle and working class letters and from middle 
and working class oral utterances. The results showed that the 
middle class boys used a wider range of words and preferred dlf* 
ferent words In this situation. The working class boys showed 
more conformity in their responses than the middle class boys, 
especially for the written materials. Fruitful lines for fur- 
ther research on 'restricted' and 'elaborated' codes are discussed.'^ 



236* RoU»'igue2-Bou, I. RecuerUo de vocabulario Espanol (Spanish vocabulary 
count). Rio Pledras, Puerto Rico: University of Puerto Rico Press, 
1952. 

The word count was encouraged and aided by the Organization of 
American States and UNESCO (program for Fundamental Education). 
It Is a list of v/ords In the Spanish language in accordance with 
the frequency of usage. It refers to Buchanan's Graded Spanish 
Word Book of 1929 which had 1,200,000 words. This word book 
considered more than 7i000,000 running words, covering both 
written and spoken Spanish of children and adults. It consists 
of Volume I and 2, Parts I and 2. Part I's sources were news- 
papers and magazines, radio programs, religious works, and 
scholarly texts. It gives frequencies of lexical units of each 
source separately, based on 1,000,000 units. St Includes fre- 
quencies of Irflectional forms as well as of the head word. 
The Introduction gives a fine history of v/ord counts and their 
Importance. The first list is of the 10,000 lexical units most 
frequently used in order of rank with separations at each 500 
through the first 5,000. The second list gives the /i,000 most 
frequent inflectional forms in order of frequency rank. The 
third list gives the first 10,000 lexical units (same as list 
I), in alphabetical order but with the frequency and frequency 
rank indicated. The fourth list gives the 20,000 word Inflec- 
tional list (list 2) in alphabetical order with frequency and 
frequency of less than 16, listed in alphabetical order. Appen- 
dix A provides the methods, techniques and procedures used in 
compiling the frequency counts. Words counted include separately 
all the variations in the form of each word, including idioms. 
However, different semantic meanings of a word were not included, 
which keeps an otherwise excellent word count in an incomplete 
form. All words were included except: unintelligible ones, in- 
vented ones, words without meaning, and some peculiarities of 
speaking or writing of children. Neologisms and regional isms 
were included but not marked as such, if they v/ere accepted by 
a panel of experts. Other words in current spoken or written 
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Spanish, but not in dictionaries, were accepted if they appeared 
to be derived according to tlie laws of composition and derivation 
of words and were used by a good number of educated people. It 
was found that the first i05 words accounted for 50 percent of 
all words used. Data sources which were used were: oral voca- 
bulary, associations, written compositions, and the count of 
Rodriquez and Casanova (University of Puerto Rico). 
Oral ; School children were placed in various situations at school 
and asked to write on the subject or discuss It orally in front of 
a tape recorder. From grades I 'o 6, some I,073»2'i5 running words 
were compiled. 

Association (Controlled) ; Groups of nine words were used up to a 
total of 10 groups (90 words). Children wrote words which occurred 
to them from words of ihe nine in a group. 

Association (Free) ; Students wrote all the words occurring to them 
In five mlnutej> from association. Some 926, 'lO A running words were 
obtained. 

Written Composition ; The procedure followed was that of Rinsland. 
The sample was grades 2 to 6 of public schools (803,622 running 
words). To these were added 5B6,Ul compiled by Rodriquez and 
Casanova In a similar collection exercise. 

Recognition Vocabulary : Nev^spapers from January through June, \Sk7i 

some 91 editions on alternate dates. (1,050,000 running words). 

Radio programs of various types (465>600 running words). 

Religious (all types in Puerto Rico). 

Buchanan's Graded Spanish V/ord Book (1,300,000 words). 

Basic Educational Texts for Elementary Schools. 

Books of supplemental reading for elementary schools. 

Appendix B - The first, most used, 105 words. 

Appendix C - Book references. 

Appendix 0 - Procedures for obtaining compositions for children. 
Appendix E - Free association stimulation words. 
Appendix F - Buchanan's AO categories of written materials. 
Appendix G - Directions for arranging order in lists of lexical 

units and inflectional lists. 
Volume 2, Port I starts with a summary of the introduction and 
history of v<ord counts. It gives the 20,5^2 lexical units of 
the count with their inflectional forms, as line Items. Under 
I A columnar headings are listed the frequencies of appearance 
of the lexical units by source of the words (A-H) . Part 2 con- 
tinues Part I for words beginning with l-Z. It also contains 
Appendices A-G as in Volume i. 
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237. Kose-lnnes, A. Fundamental spoken Japanese . (Revised and enlarged 



by W. Kos, S. J. and also known as A new 3rd edition of conversational 

Japanese for beginners) . Tokyo: Melselsha Publishing Company, I967. 

Part 1 Is a graduated exercise In conversational Japanese developed 
by Japanese In Japanese. An English translation Is given for each. 
Part 2 Is elementary grammar of spoken Japanese. Emphasis is 
placed on the sentence, not on the words. Part 3 is an explanatory 
vocabulary of common Japanese words. This is an update of an 
earlier publication by the same author which printed the vocabulary 
separately. 

238. Russo, G. A,, A combined Italian word list, Modern Language Journal , 

WJ, 3L(^), 218-2^0. 

"This list contains 3,173 Italian words with indications of 
their relative difficulty. The collection was compiled from 
two earlier selections: (1) the 'Knease List' of some ^00,000 
running words based on kO Italian literary works published In 
Italy and scored according to range and frequency, and (2) the 
'Skinner List' constructed according to range only from the 
Italian-English vocabularies of ^5 Italian textbooks published 
In the United States." 

239. Rutherford, R. W. and H. Wears, ed., Enquete sur le language de 

1 'enfant Francais, (Investigation of the language or French Children): 

The spoken language of nine-year-old French Children. The Nuffield 

Foundation (Leeds, England), 1969. 

"Transcriptions of recorded conversations of nine-year-old French 
children are analyzed and presented in this comparative word 
count. The actual count of the 55,!>88 word corpus is arranged 
alphabetically and contrasted with selected, identical words 
found in the Francais Fundemental word list. Proper nouns are 
listed separately at the end of the regular count, and gram- 
matical functions of items are mentioned when a word appears. 
The count lists word frequency, word totals, and the Francais 
Fundemental word count. Discussion of classification, column 
labels, and notes on child language are included." 
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240. Rosnan, E. A pilot project on vocabulary selection for foreign students. 



New York: Teachers' College, Coluiuula University, November 1962. 

This work describes a method of making word frequency counts for 
Afghan students at a US university. A set of 30,000 cards of words 
and phrases Is available at the International Center of Teaching 
Materials at Teachers' College. It emphasizes reading rather than 
oral vocabulary. It includes a 3 ♦'♦36 word count and a list of 
103 of them with the greatest frequency. 

241. Sacleux, C. DIctlonaIre Francais-Swahil I (French-Swahl 1 1 dictionary) . 
(2nd edition revised and augmented) Paris: Institute of Ethnology, 
1959. 

Sacleux was a missionary who worked principally In Zanzibar. Un- 
fortunately, the dictionary does not give any Indication of the 
techniques of word selection or other procedures used In Its pre- 
paration. The previous edition (1939) had 1115 pages (but printed 
In difrerent font). There Is, however, 3n Introduction which 
indicates the geographical distribution of Swahili, the alphabet 
used in the dictionary, the Swahili dialects, and notes that the 
Zanzibar dialect (Ki'Ungudya) was the principal basis for the 
dictionary and had preference In Its development. The dictionary 
also contains a two-page bibliography. 

242. Savard, J. G. Analytical bibliography of language tests (Bibliographic 

analytlque de tests de langue) . Quebec: Les Presses de I'Universite 

Laval , 1969 (bll ingual) . 

This recent book is divided Into an introduction and seven parts. 
The first five parts are subdl v!*:'?. ' 'nto an index of titles, index 
of authors, and analytical. The seven parts are: Second Language 
(150 tests). Mother Tongue (National Language-150 tests). Bilin- 
gual Tests (English-Spanish and English-French), Aptitude Tests, 
Psychological Tests, Miscellaneous (not in the first five parts), 
and Index of Publishers. Parts 1, 'i, and 5 are applicable to 
student selection and training. 

243. Savard, J. G. La valence lexical (word coverage) . Paris: Oldler, 
1970 (in French) . 

The purpose of the book Is to develop an alternative to frequency 
of occurrence as an objective measure by which to select words 
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for teaching vocabularies, especially for beginners in a second 
tanguago. Once developed, it could also be used to refine 
e;<isting vocabularies. Source documents were Fundamental French, 
1st and 2nd Levels, dictionaries of synonyms, and analogical dic- 
tionaries. The proposed alternative to frequency in establishing 
vocabularies is valence. The criteria for valence are statc^d as 
def i ni t ion (explaining power, or how often the words can be used 
to explain or define others; in other words the basic nature of 
the word), i nclus Ion (the number of words for which It is a synonym 
and for which it may be substituted), combination (combining power. 
Its use In compound words and Idioms), and extens ton (semantic 
range or power; the quality of having more than one partially or 
completely different meaning. In the analysis each criterion was 
given equal weight and Its computed value was added to the other 
three to arrive at the total valence of the word tn question. 
The text describes the proceudres used and Is plentifully supplied 
with appendices (^) , tables (16), and figures (6). Of Interest 
, is the fact that the basic French vocabulary so derived has 897 
words which in number approximates the 850 of basic English. How- 
ever, when valence ts computed, there appears to be no correlation 
between words rating high on valence and those on frequency so 
more research is required to determine the best basis for vocabu- 
lary construction. 

244. Scholes, Robert J., On functors and contentlves in children's imitations 
of word strings. Journal of Verbal Learriing and Verbal Behavior , 1970, 
% 167-169. 

**Young children (mean age 3 years, 11 months) were asked to repeat 
v^ord strings presented from tape. The strings varied In length 
from three to five words; in sentencehood In that some were well- 
formed sentences, some were anomalous, and some were syntactically 
divlant; and In word types In that some strings contained all real 
words, some contained real function words plus nonsense items, 
and some contained real content words plus nonsense items. The 
results of this experimentation suggest that children's differential 
^ Imitation of contentlves and functors is accounted for by an 

'Identify and retain contentlves' strategy and that the principal 
criteria for the classification 'contentive' are phonological form 
and semantic function.'* 

245 . Schonell, Fred J., Meddleton, Ivor G., and Shaw, B. A. A study of the 

or al vocabulary of adults: an Investigation into the spoken vocabulary 
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of the Australian worker . Brisbane: University of Queensland Press, 

1956, and London: University of London Press, 1957< 

The work contains a brief account of the history of word frequency 
studies (pages 10-27) along with a more particular discussion related 
to investigating vocabulary size. Spoken samples of the speech of 
Australian workers were taken by interviews, recordings on the Job 
and in public places. The author discusses utility yields of methods 
and correspondences of the hand tabulation of a total of tokens 
accounted for by 85 words, all of which were words "..used to pro- 
mote the flow of speech rather than to inject meaning into what Is 
sAld." 

2't6. Seashore, Robert H., and Eckerson, Lois D. The measurement of Indlv 
idual differences in general English vocabularies. Joarnal of Educa- 
tional Psychology , 1940, 21^, 14-38. 

This study presents a detailed account of work aimed at discovering 
the size ofvvocabularles by means of multiple-choice tests derived 
from dictionary entries. 

247. Sebeok, Thomas A. (ed.) Style in language . Cambridge, Mass., New York, 

and London: HIT Press, I960. 

A collection of papers presented at a conference on stylistics 
held at Indiana University in 1958, by literary critics, linguists, 
psychologists, and cultural anthropologists working in stylistics. 

248. Shannon, C. E. Prediction and entropy of printed English. Bell System 

Technical Journal , 1951, 30, 50-64. 

"A new method of estimating the entropy and redundancy of a lan- 
guage is described. The method exploits the knowledge of the lan- 
guage statistics possessed by those who speak the language, and 
depends on experimental results In prediction of the next letter 
when the preceding text is known. Results of experiments in pre- 
diction are given, and some properties of the Ideal predictor 
are developed." 
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249. Shapiro, B. J. The subjective scaling of relative word frequency. 

Doctoral Thesis, Harvard Graduate School of Education, Ann Arbor, 

Michigan: University Microfilms, 1972. 

This study explores subjective scaling of word frequencies and its 
relationship to objective scaling. If q relationship could be 
establishedi the enormous samples required in objective scaling 
studies could be avoided by using subjective scaling and converting 
it mathematically to the equivalent objective scaling result. The 
study explores the subjective scaling of relative word frequency 
of English In relation to the Fechnerian (logarithmic) and Stevens 
(power) psychophysical theories, various Informant populations, 
and both the written and the spoken language. The Thorndike* 
Lorge 30,000 word Teacher's Word Uook (IQAM and the Francis and 
Kucera word count of 1965 v^cre used as the objective criterion 
measurements. Eighty-eight word-stlmul t I for the subjective 
scaling were selected from these sources to cover frequencies 
ranging from .2 to 68,000 per million. 18'i informants were se- 
lected from as varying populations as sixth and ninth graders, 
college sophomores, and adults in as widely distributed occupa- 
tions as industrial chemistry, elementary school teaching, and 
newspaper reporting^ Two subjective scaling methods were used: 
multiple rank order and magnitude estimation. In addition, liatf 
of each informant group responded in terms of written language 
and half in terms of spoken language. The author concludes that 
the magnitude estimation technique tended to follov^ the Stevens 
(power law) model and the multiple rank order technique was 
closely related to the Fechnerian (logarithmic law) model, Chi 
Square tests showed, however, that the observed multiple rank 
order data did not fulfill all the assumptions of its analytic 
technique. In addition, the magnitude estimation and multiple 
rank order techniques were logarithmically rather than linearly 
related, Shapiro also found that relative word frequencies are 
a prothetic (how much?) psychological -additive variable rather 
than a metathetic (what kind or where?) substitutive variable. 
He further states that the same observations apply to other 
linguistic units, such as syllables, grammatical constructions, 
and letters. He concludes that they are best measured subjec- 
tively by the magnitude estimation technique, but that there is 
a need for additional studies to verify his findings. The study 
Is profusely illustrated by tables and figures and contains 
samples of forms used in sample collection from informants. In 
addition, the procedures used are amply detailed. This is an 
Important contribution to ongoing work on alternative methods 
of obtaining word frequencies without the voluninous sampling 
used in the objective or direct counts. 
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250. Sherwood, John, and Morton, Iver Phoneme frequencies in Australian 

English; a regional study. Journal of the Australasian Universities 

Language and Literature Association . 1966, 26, 272-302. 

This study involved transcribing 16,800 runni ig words rrom the 
speech of twenty-two adults and thirteen children according to 
the phonetic scheme developed by Professor A. G. Mitchell for 
Australian English. The statistical results are compared with 
other published counts. The work is part of a larger study of 
regional variation in Australian English. 

251. Silial^us, H., and Morris, K. Some reflections on the lactc of accur^^cy 

of word frequency lists. Review of the ITL . (Institute of Applied 

Linguistics, Louvain, Belgium), 1970, 2> ll~}3. 

The article discusses the ubiquitous problem of error in fre- 
quency counts due to low occurrence of important items in word 
counts taken from relatively small lengths of running text, and 
illustrates with examples. They reinforce the argument for longer 
samples of running text and shorter frequency lists, i.e., thematic 
or topical . 

252. Simon, Herbert A. On a class of skew distribution functions. Biometr ika . 

« 

1955, ^, kZS'kkO, 

This article considers a variety of such functions; in discussing 
empirical distributions, he examines the distribution of word 
frequencies. 

253. Skinner, B. F. The distribution of associated words. Psychological 

Record, 1937, 1, 71-76. 

Skinner shows that the rank-frequency relation described by Zipf 
applies to "samples of speech selected on a semantic basis". He 
illustrates his thesis by an examination of free af.sociation re- 
sponses. 

25^. Society for International Cultural Relations ^Japan) Japanese ba«,ic 

vocabulary . Tokyo: Kokusai Bunka Skinkokai (KBS) , 19^*1. 

This is a 2000 basic word list. The class and inflections of each 
word are given, and the several meanings of each word are carefully 
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explained. With Its examples of compounds and synonyms, this Is 
almost a bs^lc dictionary. The words were selected by a committee 
based on subjective criteria. One of the purposes of this book 
was to assist in teaching Japanese to a foreigner. 

255. Society for International Cultural Relations (Japan) KBS bibliography 

of standard references for Japanese studies with descriptive notes . 

Tokyo: Rol^usai Dunl^a Shinl<ol<ac (KBS), 1961, j^. 

This bibl iograpl;y is similar to, but extends further bacl< into 
history than the c le by Yamlgawa. It has 16 chapters of wliich 
the following are especially Important bibliography, dictionaries, 
phonetics and phonology, grammar, and special languages and 
lexicology. 

256. Somers, M. H. Analyse mathematlquc -l u language-lois generales et 

mesures statlstiques . Louvain: 1959. 

In this book after examining the formulas of Zipf, Mandelbrot, 
and Simon, the author proposes a lognormal distribution for voca- 
bulary and applies it to texts for which counts have been published. 
He uses It in estimating vocabulary, text- length ratios, word- 
length iistr ibutions , etc. A formula for type-token rations is 
given. 

257'. Spolsky, Bernard et ai, A spoken word count of six-year-old Navajo 

Children. New Mexico University, 1971. 

"As part of a study of the feasibility and effect of teaching 
Navajo children to read their own language first, a word count 
collected by 22 Navajo adults interviewing over 200 Navajo 
6-year-olds was undertaken. This report discusses the word 
count and the Interview texts In terms oT (1) number of sentences, 
(2) number of words, (3) number of tokens, (k) type-token ratios, 
and (5) word-length. A frequency list gives all words used by 
at least 2 children. Tho words, mostly Navajo, are grouped In 
order of frequency and In alphabetical order ^Ith each frequency. 
A supplement lists, alphabetically, all words from the Interview 
texts (whether used by children or adults). Frequency and range 
data for adults and children are given separately and in total 
for each word. 
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258. Stone, P., et^ aj,. User's manual for the general Inquirer . Cambridge, 

Mass.: MIT Press, 1968. 

This is a companion volume to "The General Inquirer; A Computer 
Approach to Content Analysis" (MIT Press, 1966) . The purpose of 
this book Is to provide technical specification of the computer 
programs in the General inquirer content analysis system and 
detailed instructions for using these programs. The manual is 
divided Int6 four parts of 3*6 chapters each and a series of 
appendices. Part I Is an Introduction to the system and speci- 
fications for preparing dictionary and text data. Part 2 discusses 
primary programs for assigning tags to text, and performing oper- 
ations of listing, counting, and retrelval on tagged text. Part 3 
relates secondary programs for processing tag scores. Part k gives 
two types of secondary programs to facilitate the development of 
dictionary categories— one for generating a key-word- 1 n-context 
index of a sample of text, and the other for displaying a diction- 
ary in a special "cross-sorted" format. 

259. Swenson, E., and West, M. On the counting of new words In textbooks 
for teaching of foreign languages . (Bulletin Mo. I, Department of 
Educational Research) Toronto, Canada: University of Toronto Press, 

This short study is an excellent background on the subject of 
word counting. It is in iwo parts: on counting of new words 
and the history and purpose of word counting and analyses Its 
procedures (as of 1933-193^) . Of special interest are the 
chapters on the origin and counting of a speaking vocabulary. 
Part 2 on rating scales provides specific methods for rating 
the difficulty of learning meaning, idioms, cognates, compounds, 
and spel I i ng-pronounciat ion discrepincies (I.e., the words 
that do not Round the way they loo< or vice versa). The study 
was undertaken to eliminate some of the confusion of the tech- 
niques of word counting and vocabulary control. It discusses 
Palmer's work In the field, reasons for word counts, differences 
between written and spoken counts, methods of rating words, 
word rating scales, and testing the scales for reliability 
and intelligibility. 

260. Swenson, Rodney A frequency count of cont<^mporary German vocabulary 
based on three current leading newspapers. Dissertation Abstracts, 
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I967i I0i 222A-23A (Also final report of the Director on a frequency 

count of contemporary German vocabulary based on three current leading 

newspapers. A project of the US Office of Education in cooperation 

with Hamllne University, St. Paul, and the University of Minnesota, 

Minneapolis. Washington, D.C., Department of HEW, USOE. 19G7.) 

The foreword discusses the lack of validity of earlier word counts 
because of changes in language and changes in the goals and objec* 
tives of teaching foreign languages. The goal now is oral, v/hlch 
leads to the oral vocabulary needed to learn the most frequently 
used words In current fpecch. The newspapers selected for the 
study were: Die Welt in Hamburg (103,8^0 words), Suddeut'sch Zeitung 
In Munich (232,280 wordi) , and Frankfurter Allgemelne Zeitung in 
Frankfurt a/M (167,700 words)- The papers were selected since they 
all had a fairly we 1 1 -'educated reading audience with wide West Ger- 
man geographic spread. Each had a circulation of 210,000 to 
250,000. Since In Germany two-thirds of the population read news* 
papers, newspapers were selected for the sampling as they were 
considered to contain words in general usage. As noted above, 
the sampling did not include East Germany, Austria, or Switzerland. 
The sampling took place from 1 October I96A through 31 January 
1965 (four months). SQk.OOO running words of an estimated 12 
million were tabulated. Samples were taken from columns at 
least six inches in length. Every fifth column was^elected 
and the first 120 words counted and tabulated. Advertisements, 
want ads, headlines, and picture captions and supplements were 
not used. Proper nouns, geographic names and abbreviations were 
also excluded from the count but were tabulated and filed sepa** 
rately. V/ord forms were tabulated under the root form or infin- 
itive. Tabulations followed Pfeffer's Basic Spoken German. The 
final lists are of the first 500, 1000, and 6500 words of the 
German language as reflected in newspapers. The tabulations 
Indicate the count by newspaper and by total. The conclusion 
is that with the 1500 words in the longest list, one could under* 
stand a considerable amount of contemporary German. The list 
does not Include grammatical forms and is religious, sports, 
political, literary, and science oriented. Frequencies Indicate 
changes are required In the sequence of Instruction, especially 
In verbs, since forms used now are not being taught first. 
Articles (grammar) have about the same frequency now as in the 
counts of the 1920 's. High frequency v^ords In one paper were 
generally high in the others. This was not so true of low 
frequency v;ords (probably reflecting newspaper policy or regional 
difference). There are two m^jor lists: numerical frequency (one 
for each newspaper and a total) and alphabetical (for the total 
count only) . 
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261. TadashI, Kikuoka The 1000 most Important Japanese newspaper charac- 
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tur compounds in order of descending frequency . South Orange, fi. J.: 

Institute for Far Eastern Affairs, Seton Hall University, I965. 

The sources are: A glossary of Journalistic terms . Tokyo: 
Nihon Simbun Kyokal , 1961 , and the Dictionary of Journalistic 
terns . Tokyo: Asakal Shimbun-sha, 1951 . The original selection 
of compounds was made by Dr. lUrosha Okubc at H-isei University^ 
Tokyo. It was updated by the present autlior. 

262. Tanrienbaum, Percy H., and Williams, F. Prompted word replacement In 

active and passive sentences. Language and Speech . I960, JJ!_, 220-229. 

"A conceptual focus formulation developed In a prior study of 
encoding of active and passive sentences led to predictions con- 
cerning how such sentences are stored and their main semantic 
units are retrieved from memory. The formulation posed a dom- 
inant subject-verb linkage in active sentences but an object- 
verb linkage characterizing passive sentences, 'ndlvlduals were 
presented with either the subject, or verb, or object of pre- 
vious exposed sentences and were required to replace the other 
two missing words. As anticipated, the subject was a better 
prompt for the verb In active than In passive sentences but the 
reverse relationship obtained when the object was the cue. 
Similar predictions for situations when the verb was the prompt 
word were supported v/hen the subject was the response but not 
when the objects were to be replaced." 

2^3. Tarnoczl, Lorant. Wortbestand, wortschatz, wortf requenz. I RAL , 1971, 
% 297-310. 

Written In German, this article criticizes the manner In which 
minimal lexical Inventories are made and questions the claims 
made for such lists. He refutes the claim that a minimal voca- 
bulary of 1000 to 2000 words will enable one to understand 75 
percent of a text in a given language, particularly since the 
texts used are theme oriented and this fact Is not taken suffi- 
ciently Into consideration. The author proposes that minimal 
vocabularies based on frequency counts must distinguish be- 
tween basic vocabulary and thematic vocabulary and that a min- 
imal vocabulary must vary according to the teaching objectives. 



279 



J 4^ r 



26^. Taylor, G. Learning American Engllsh > New York: Saxon Press, 195^. 



This book is planned to meet the needs of adult students at the 
beginning of intermediate states of learning English as a second 
spoken language. The English described is a basic, informal » 
spoken language, used by the majority of US citizens. There arc 
17 lessons each followed by exercises. There are over 1500 English 
words used, but the author suggests concentrating on the first 
550. Word lists are derived from Thorndike and Lorge's 30,000 
Word Teachers' V/ord Book and the KLM List of Bongers * 

History and Principles of Vocabulary Control (19^7)* 

265* Thomson, Godfrey H., and Thompson, J. Ridley Outlines of a method for 

the quantitative analysis of writing vocabularies. Bri tish Journal of 

Psychology , 1915, 0. 52-69. 

The question addressed by the study is: '*How can we find a measure 
to enable us to estimate the total vocabulary from the study of a 
sample?^' (page '^S) . The method they develop consists of assigning 
weights to particular words according to the number of times they 
appear in the sample; the proposed formula is then used to estimate 
the author's total vocabulary. The technique is tested through 
a study of chapter fifty-five of David Copperfield . 

266. Thorndike, Edward L. On the number of words of any given frequency^ 

of use. The Psychological Record , 1937» U 399-^*06. 

This is a discussion and criticism of ZIpf's rank-frequency hypo- 
thesis, the essay includes data from a A.5 million word sample of 
the language of children's books. 

267» Thorndike, Edward L. and Lorge, I. The teachers' v^ord book of 30,0^ 0 

words . New York: Teachers' College, Columbia University, 1959. 

This publication consists of five parts and is the last of a 
series of word lists published by Thorndike v/lth or without 
co-authors, e.g., the 20,000 v/ord lists. Part 1 is a list of 
words occurring at least once per 1 million words. Part 2 
is a list of words occurring at least once per k million words. 
Part 3 is an explanation of Parts 1, 2, A, and 5* Part k in- 
cludes the number of occurrence's of words occur ing 1000 or 
more times in ui^^her of the counts (Lorge Magazine Count and 
Lorge-Thornd i ke Semantic Count). Part 5 is a list of 500 
words occurring most frequently and of the 500 occurring no^t 
most frequently, 
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268. Uhllrova, L. On statistical experimenting In syntax. Statistical Methods 



of Linguistics . I969, 5., 13-33. 

"The first stage of statistical research on Czech word order 
in relation to syntax and to the so-called topic-comment 
bipartltion of the sentence. A necessary prerequisite of 
this research is the determination of whether or not the lin- 
guistic material In question Is representative from the sta- 
tistical point of view. Hence, the objective is to establish 
by statistical experimentation whether or not a certain corpus 
is representative (sufficient) for the purpose of a synatactlcal 
word-order analysis. Samples of 1000 successive clauses were 
selected arbitrarily from texts of different genres and analyzed 
syntactically in accordance with Czech authorative grammar. 
Three classes of clauses were differentiated: clause simple 
sentence (J) , sentence (V). Tables are Included which Indicate 
the cumulative distribution of clause length across clause types 
and by clause type, frequency of occurrence of each clause type, 
frequencies of syntactic categories in clauses of varying lengths, 
as well as graphs representing these data." 

269. University of Michigan (English Language Institute) Selected Articles 

from language leafning . (Series 1. English as a forelgh language) 

Ann Arbor, Mich.: Langugae Learning Reprints, 19>3. 

This Is a selected group of articles by prominent scholars of 
Erglish. The articles were picked to emphasize the "new" (1953) 
approach in structural linguistics from items on word and sound 
and froifj the system of contrastive patterns in which the Items 
operate. There are six parts to the collection, each with from 
3 to 13 articles. They include language learning, language 
teaching, grammar, pronounciation, vocabulary, and testing. 

270. University of Michigan (English Language Institute) Teaching and learn- 
ing English as a foreign language . Ann Arbor, Mich.: University of 
Michigan Press, I962. 

This work Is a basic explanation of English as a foreign language. 
It Includes an introduction on adult learning, sounds, structure 
arrangement and form, words and vocabulary, and contextual orienta- 
tion. 




i7l • Vakar, Nicholas P. A word count of spoken Russiarr-the Soviet usage . 



Ohio State University Press, 19^6. (This is the final report on US 

Office of Education HEW Contract CEC-3-6-0620A6» July I969) 

There are two parts (volumes) to the report; Part t» vocabulary 
(normal colloquial vocabulary)— obtaining samples of colloquial 
Soviet Russian speech is difficult. Professor Vakar notes tliat 
since 1956, hov^ever, Soviet drama has come to deal with everyday 
problems and has been presenCcJ in the language of the audience. 
As a result, Professor Vakar based his study on an actual count 
of 10,000 words from 50-word samples taken from 200 acts of 93 
plays published since 1957. With the small sample. Professor 
Vakar assumed that the most common words occur In virtually ev^^ry 
conversation of any length, so the sample need not be large. In 
fact, he found that 360 words of a total of 2360 words In the 
10,000 word sample represented 73 percent of all occurrences and 
are satisfactory for intelligent adult oral communication. Also 
of note are some 75 word-clusters which indicate the cumulative 
frequency of occurrence of certain stem or roots. Part 2, sentence 
structure (colloquial in Soviet usage). There is a tremendous 
difference between literary and spoken Russian. A set of a few 
hundred common words, grammar fundamentals and favorite turns of 
phrase constitutes the core of ordinary conversation. Chapter 2 
gives the basis for the sentence sample. It also used monologues 
and dialogues of 93 plays written from 1956-196^4, representing a 
statistical universe of I million running words. From tncse 1000 
sentences were randomly selected. Sentences tend to be short — I 
to 5 words (75 percent). Also included is a glossary of nouns, 
verbs, adjectives, and adverbs as well as four appendices. 

272. Vakar, Nicholar P. Statistical methods in the analysis of Russian. 

Slavic and East European Journal . 1567. JJ,, 59*65. 

Vakar here contrasts two counts, one based ori a word count of 
contemporary plays and the other on a woro count of actual conver- 
sations. He then discusses the implications of such counts for 
language teaching. 

273. Van den Eynde, R. Gramma ire Swahil> (Swahili grammar) . Brussels: 
Waithhoz-Legrande, 19^^. 

This book Is in French. It does contain a vocabulary as well as 
the grammar suggested by the title. This book was originally 
intended for students who would later spend some time in the (then) 
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Belgian Congo. Although the Swahlll dialect spoken In the Re- 
public of Zaire Is a poor one found larcjely In Katanga (KIngwana), 
the author has addressed himself to "pure" or standard Swahlll of 
the east coast of Africa. The grammar part takes up the first 
88 pages with nine chapter It Is followed by the French-Swaht 1 1 
vocabulary. The alphabetized French-Swahl 1 1 vocabulary Is 
followed by a special vocabulary on units of measurement, days of 
the week, and a Swahl 1 1 'French vocabulary. 

27'*. Vander Beke, George E. French Word Book . (American and Canadian 

Committees on Modern Languages) New York: The MacMllllan Company, 

1929, 15. 

The basic method that was used was that of Henmon in his French 
Word Book Based on a Count of ^00,000 Running Words (see above) 
where categories of words were selected and teachers of French 
were used to help compile the ultimate product. The Committee 
combined their list with that of Henmon. The results was a count 
of over one million running words from eighty-eight examples of 
French prose. The study, however, did not Include anything 
other than printed material and used no particular crUcria for 
selection of sources. The text'i used to draw up this count were 
selected from nineteenth and twentieth century literature, and 
were divided Into twelve categories. The method of tabulation 
and analysis Is also described. Part I lists the words omitted 
from the count but listed In the Henmon study. Part II lists 
the words by range and gives both the range and frequency. Part 
III combines the list with the Henmon list and gives the range, 
frequency, the Henmon frequency and the total frequency of the 
word. The order Is alphabetic, An Appendix list those words 
which Henmon listed, but which the cormnittee found to be too 
low in frequency to be counted. 

275. Van Spaandonck, M. Practical and syst emat ic Swahlll blhUpgraphy- 

llnguistlcs 1850-1963 . Leiden, Netherlands: E. J. Brill, 1965. 

Chapter headings Indicate the coverage by classification: general, 
linguistics (Including grammars. Instruction, and phrase books, 
supplerental llngut ic studies, dictionaries and vocabularies, 
supplemx:.ntal vocabulary studies), literature, Katanga Swahlll, 
and an appendix. 

276. Voelker, Charles H. The one-thousand most frequent spoken words. 



Quarterly Journal of Speech , I9'*2, 20, I89-I98. 
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Voelker's list is based on a sample of 99*^00 words gathered from 
the speech of older adolescents. 

277. Walsh, S. (ed.) English language dictionaries in print"-a comparative 
analysis . Newark, Delaware: Reference Books Research Publications, He, 
1965. 

This compilation compares dictionaries by cost, number of entries, 
age, group suitability, date of publication, and user ratings. It 
also describes each one briefly. 

278. Wepman, J., and Mass, W. A spoken word count (children 5-7) . Chicago: 
Language Research Associates, <969* (See also A spoken work count by 
Wepman and Jones) 

This study attempts to supply additional information on the quanti- 
tative aspects of chiidrens' word usage. It also discusses uses 
of chiidrens' word counts aside from inclusion in readers. The 
sample group was 90 middle-class English-speaking metropolitan 
children aged 5-7 (30 each) and boys/girls (kS each). The pro- 
cedure was similar to thai used by Wepman and Jones for adults. 
The material was computer processed into three lists: word fre- 
quency. In order of frequency, part of speech by grammatical class, 
and alphabetical list of all words used by at least two speakers. 
It also contains a short bibliography. 



279* Werner, H., and Kaplan, B. An organismlc-developmental approach to 

language and the expression of thought. Symbol formation . New York: 

John Wiley and Sons, Inc., 1963. 

The authors use their perspective of psychological phenomena to 
demonstrate how that perspective enables an individual to order 
tind Integrate data on symbol ization and language behavior. The 
book contains five parts: organismlc-developmental approach, 
formation and general changes In verbal symbolic behavior in the 
course of ontogenesis, processes which underlie the primordial 
states of linguistic representation through study of adult behavior, 
linguistic representation under differing conditions of communi- 
cation, and symbol formation in non-verbal media. 



284 



280. West, M. On learning to speak a foreign language . London: Longmans, 

Green, and Company, Ltd., 1333. 

This Is a book on teaching spoken English as a foreign language. 
It discusses purpose (aim), policy (theory), techniques (methods), 
vocabulary (reading and speaking vocabularies), and minimum adequate 
vocabulary to Include the concept of completeness and vocabulary 
design. West considers the Thorndike and Horn 1000 most common 
word lists, the American College list (1000 words, and similar to 
Thornd Ike's) the 1000 words used by the Adult Education Society 
of New York, Palmer's list of 600 words, Palmer's Composite Word 
Frequency List of 1000 words, and Ogden's Basic English Vocabulary. 
He finally arrives at a list of 996 words and procedures for struc- 
turing lessons from those words and his text. 

281. West, M. Definition vocabulary . (Bulletin No. 4, Departinent of Edu- 
cational Research, Ontario College of Education, University of Toronto) 
Toronto, Canada: University of Toronto Press, 1935. 



This Is a study on how to determine the vocabulary to be Included 
In a dictionary for foreigners. The author argues that the major 
problem In preparing a dictionary for foreigners learning English 
Is selecting the words to define the words In the dictionary, In 
such a way that both will be understood. West solves the problem 
by determining some M»90 words and 85 irregular verb forms and 
plurals with which he can explain all the words and Idioms In the 
proposed dictionary of \7iP7 words and 6,171 Idioms. (This, of 
course, becomes difficult when the Idioms are not self-explanatory 
on the basis of individual word meaning.) The book has three 
chapters and 28 explanatory tables. 

282. West, M. A general service list of English words . London: Longmans, 

Green and Company, Ltd., I960. 

This book contains a frequency count based on 5 million words and 
has a semantic count by percentages for written and printed English. 

283. West M. Teaching English In difficult circumstances (teaching English 
as a foreign language) . London: Longmans, Green and Company, Ltd., 
i960. 
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This book Is based on experience Jn teaching English In India and 
the Middle East. In general, the book Is a teachers' guide but It 
has an appendix with a minimum adequate (1200 word) vocabulary for 
spoken engllsh and a classification guide to accompany It. 

284. West, M. An International readers' dictionary . London: Longmans, 

Green and Comapny, 1965. 

This dictionary contains ?M,000 Items (18,000 words and 6,000 
Idioms). It Is designed for the use of persons for whom Engllsh 
Is other than their native language. It supercedes and updates 
the New Method Dictionary by West, and End I cot t (1935-1960). Ex- 
planations are made within a vocubulary of 1490 words held to be 
among the most common In the Engllsh language as learned by 
foreigners. It excludes scientific and technical terms In common 
use In news media and books. It excludes also certain derivations 
and compounds when their meanings can be Inferred from the root 
word and context. 

285. West, M., and Bond, 0. A grouped-f reguencv French word list . Chicago: 
University of Chicago, 1939. 

The purpose of this book was to re-work the Vander Beke French 
Word Book Into forms more useful to teachers. This book 'by West 
and Bond has three parts: frequency list In numerical order with 
Inflectional forms under head-words and listed In 100 word groups 
Index— an alphabetical list of head-words, and two appendices: 
fifty Latin roots and common French affixes (prefixes and suffixes). 

286. Whatmough, J. Language— a modern synthesis . New York: Mentor Books, 
1956. 

In many respects this book looks like a companion piece to Zlpf s 
A Psycho-Biology of Lanffuac^e. It has 13 chapters Including past 
and present languages, words and meanings, the uses, structure, 
analysis, and neural basis of language, as well as the mathematics 
and statistics of language. 

287. White ley, W. Some problems of translvlty In Swahlll . London: LuzcC. 
and Company, Ltd. (for the University of London), I968. 
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Two areas in which existing Swahtll dictionaries are weak are tran- 
sivity and verbal extensions. This book deals with translvlty 
(of verbs). Translvlty deals with the various relationships which 
obtain between a verb and a noun or nouns to which the label 
"object" Is often accorded. This book deals with the subject In 
detail, as an aid to writing more precise meaning Into dictionaries. 

288. Whlteley, W. Swahlll— rise of a national language . London, England: 

Methuen and Company, Ltd., 1969. 

This book Is a broad survey of the Swahlll language and literature. 
Its early history, spread, status In the colonial period. Its 
current status, and Its prospects. One chapter Is devoted to 
"standard Swahlll" and a bibliography Is Included. 

289. Whlteley, W., and Gutkln<|, A. A linguistic bibliography of East 

Africa. Kampala: East African Swahlll Committee and East African 

Institute of Social Research, Makere College, 1958. 

This Is a classified bibliography indexed In part by country 
(Tanzania, Kenya, and Uganda), and In part by language (Swahlll). 
It Is very useful In finding local names for flora, fauna, as 
well as for more general works on the languages and linguistics 
of East Africa. 

290. Wilson, P. Engl Ish-Swahll I (classified vocabulary) . Nairobi, Kenya: 
East African Literature Bureau (undated). 

The vocabulary Is classified by vocation as: agriculture engin- 
eering, fishing, household, medical, and veterinary. 

291. Wilson, P. Slmpl Ified Swahll I . Nairobi, Tanzania: East African 
Literature Bureau, 1970. 

This book Is written for the individual who wants to achieve a 
quick general knowledge of Swahlll. Grammar Is kept moderately 
simple and Is introduced as required and In order of relative 
Importance of subject matter. It Is an up-to-date book which 
will help Individuals learn spoken Swahlll. It Includes transla- 
tion exercises and keys to them. At the end of the book there 
are Swahl 1 l-Engl ish and Engl Ish-Swahl 1 1 vocabularies. 
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292. Winter, Ralph Dana English function words and content words: a quanti* 

tatlve Investigation. Dissertation Abstracts . 195'» Hf^, 1084-1085. 

The author chose two texts of 4000 words each and parsed them 
according to the system established by C. C. Fries. Among the 
statistical measures employed were: word length In segmental 
phonemes, gaps between repetitions of the same word, "average 
interval between successive occurrences of a word", and "sum of 
the squares of the Intervals between successive occurrences of a 
word". The study shows that quantitative data <^enerally supports 
the division of function and content words. 

293. Wisbey, R. A. (ed.) The computer in literary and linguistic resefrcn , 

(papers from a Cambridge symposium) Cambridge, England: Cambr!Jge 

University Press , 1971 . 

This Is a compilation of articles on the subject In seven parts, 
each of which has from 3 to 6 articles. Parts are tHled as 
follows: Lexicography, Textual Archives, and Concordance Making; 
Textual Editing and Attribution Studies; Vocabulary Studies and 
Language Learning (The most pertinent Item Is D. G. Burnett-Hall 
and P. Strupple's "The Use of Word Frequency In Language Course 
Writing"); Stylistic Analysis and Poetry Generation (The most 
important Item is T. R. Tallentlre's "Mathematical Modeling In 
Stylistics: Its Extent and General Limitations"); Computer Appli- 
cations to Oriental Studies; Problems of Input and Output; and 
Programming the Computer for Literary and Linguistic Research. 

294. Wright, C. W. An English word count . (Department of Education; Arts 

and Sciences Research Series No. 15, National Bureau of Educational 

and Social Research, Praetor la. South Africa, 1965) London: Longmans, 

Green and Compnay, Ltd., 1965. 

This count is based on written English In South Africa. The 
counts were taken from The Bible, newspapers, periodicals, 
literary works, and correspondence. It has three lists 
covering 20,000 words: (I) first 1000 words In alphabetical 
order with an Indication of the grouping of 100 words In which 
the word falls, (2) first 10,000 words, and (3) second 10,000 
words. 
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295. Yamaglwa, J. (ed.) Japanese language studies {n che Showa period. 



(Univerlsty of Michigan Center for Japanese Studies, Bibliographic 

Series No. 9) Ann Arbor, Mich.: University of Michigan Press, 1961. 

This Is a bibliography of modern Japanese work on the Japanese 
language since 1926 (The Sho^a Period). Chapter headings are: 
Bibliographies, Essay Series and Journal, Dictionaries, Encyclo- 
pedias, and Indices of Vocabulary, Outlines aand Description of 
Japanese Language Studies, History of Japanese Language Studies, 
Phonology, Granmar, Relationships of Japanese to other Languages 
of East Asia, History of the Japanese Language, Dialect Studies, 
Writing Systems, List of Publications, and Authors and Editors. 

296. Yamaglwa, Joseph K. Linguistic data: some quantifications. Studies 

In languages and linguistics in honor of Charles C. Fries . Albert H. 

Marckwardt (ed.), Ann Arbor: 1964, 35-S3. 

This study provides a statistical examination of the stylistic 
varieties In contemporary Japanese. 

297. Young, I., and NaUaJIma, K. Learn Japanese— col lege text . (Asian 

Language Series) Honolulu: East-West Center Press, 1967, X'it* 

This series was originally written as Learn Japanese-Pattern 
Approach . The dialect used Is of native speakers of a middle 
Class background, college education, residents of the Yamanoto 
area of Tokyo and 2§-45 years of age. The pattern approach Is 
more than formula-appi icaticn. It develops a new presentation 
based on association and repetition. It reflects connecting 
links between modes of utterances or patterns. A pattern Is a 
structure related to other structures. Moving from one structure 
to another is done by transformation. The patterns reflect 
"live" situations as well as the structure of the language. 
The material is based on a contrastive study of English and 
Japanese structure. Volume I has 15 lessons. The general for- 
mat of lessons includes: useful expressions, pattern sentences, 
dialogues, tvotes, vocabulary, Hlragana practice, and drills. 
In many respects, this approach parallels that of Jordan and 
Chaplin's Beginning Japanese , but teaches hearing, reading and 
writing, as well. Volume 2 is a continuation of Volume I but 
Is more advanced. Volume 3 is an introduction to Kanj i I charac- 
ters. It has 15 lessons and seven appendices, including a 
glossary. Volume A introduces more Kanjii characters and has 
15 lessons and seven appendices, including a glossary. 
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298. Zale, E. M. (ed.) (Proceedings of the) Conference on Language and 



Language Behavior . New York: Appleton-Century-Crofts , I968. 

The Conference on which this volume reports was held under the 
sponsorship of the Center for Research on language and Language 
Behavior at the University of Michigan In 1966. Major topics 
discussed were: first language acquisition in natural setting, 
controlled acqi Isltlon of first ianguaga skills, second language 
learning, llngu'stlc st. cture above ser.;ence level, phonology 
and phonetics, and language Impairment. Most of the six subject 
areas were covered by four speakers each. The three major ad- 
dresses, attended by all conferences were as follows: (1) 
"Scylla and Charybdis, or the Perilous Straits of Applied Re- 
search: by A. P. Van Teslaar, (2) "Thought and Language" by 
James J. Jenkins, Director of Research, Center for Research In 
Human Learning, University of Minnesota, (not Included In the 
report, but published otherwise by the University of Pittsburgh), f 
(3) "Word Frequency Studies and the Lognormal Distribution" by 
John C. Carroll. Carroll's address Is an extendeti edition of 
the one actually presented to ths Conference. The main theme 
has also been published separately In several forms. Also of 
Interest are the remarks on high and low association passages 
In the talk by Sheldon Rosenberg on "Language Habits and Re- 
call of Connected Discourse", the discussion of negation In 
Japanese In "What Does a Child Mean When He Says No?" by David 
McNeill and Nobukc B. McNeill, "The Indices of Coverage: A New 
Dimension In lexicometrlcs by Wc F. McKey and J. G. savard, 
"Auditory Discrimination and the Learning of Languages" (In French) 
by EnYnanuel Companys, and "Remarks on the Predictive Value of 
Differential Analysis In Phonology" (In French) by Guy C. Capeile. 

299 « Zlmmermann, Jon E. Word frequency In the modern German shorter narra- 
tive. Dissertation Abstracts , 1968, 28, 3362A. 

The word count was based on 702 samples of narrative prose by 266 
different authors; 160,000 words were randomly selected from the 
two million running words In the whole corpus. The count reflects 
the distribution of words in the 160,000 word sample. 

300. Zipf, G. K. Observations on the possible effect of mental age upon 

the frequency distribution of words from the viewpoint of dynamic 

philology. Journal of Psychology , 1937, ^, 239-2^4. 
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This Is a response to such critics of Zlpf s rank- frequency hypo- 
thesis as Empson, Joos, and Thorndlke. The author points out 
that the rank-frequency relation holds even when total vocabulary 
size Is different, as In the language of two children of different 
ages that Is discussed In the essay. 

301. Zlpf, G. K. Homogeneity and heterogeneity In language. In answer to 

Edward L. Thorndlke. The Psychological Record . I938, 2,. 3^7-367. 

Here Zlpf replys to Thorndlke's criticism of this rank-frequency 
hypothesis; he then tests his theory by a study of word distri- 
bution In several works by James Joyce. 

302. Zlpf, G. K. The meaning- frequency relationship of words. The Journal 
of General Psychology . 19^5, 33, 251-256. 

This study Investigates the relaf. lonshlp betwer^n the frequency of 
occurrence of a word and Its number of meanings. 

303. Zlpf. G. K. Human behavior and the principle of least ef< - ort . Cambridge, 

Mass.: Add I son-Wesley Press, Inc., 19^9. 

In this book the author demonstrates his theory of least effort 
and human behavior In two contexts: language and ths structure 
of personality and human relations; a case of Intraspecles 
balance. Part 1 Is of Interest to linguists since It deals 
with the use of language and symbol formation. 

30'». Zlpf. G. K. The psycho-biology of language (an introduction to dynamic 

phi lology). Cambridge, Mass.: MIT Press, 1965 (Originally Houghton- 

Mlfflln Company, 1935). 

In this book, Zlpf explores speech as a natural phenomenon— a 
blologlcal-physiologlcai and soda) process— by the use of sta- 
tistical approaches. He finds that the distribution of words !n 
English approximates an harmonic series. He Includes meaning and 
emotion In his studies of language forms and functions. His book 
Is divided In six chapters: Introduci ^n. Form and Behavior of 
Words, Form and Behavior of Phonemes, Accent Within the Word, 
The Sentence (Positional and Inflectional Languagea) , and The 
Stream of Speech and its Relationship to the Totality of Behavior. 



Aborn* Murray 


1. 2 


Bourne, Charles P* 


3*» 


Allen, H, Jr. 


lUU 


?oven, John H« 


35 


Allen, J. 


3 


Brader, Meurcia 


If 


Allen, V, 


U 


BralRf JaLa 


36, 37 


American Mathematical Society 3 




39 


Ardouin. P. 


182 


Buchanan 9 A a 


ti^ 
UO 


Ashfi n . P « 




Buchanan, Ma 


i.<i 

4l 


Ashtoiia Ea 


7 

1 


Buettner, Ca 


US 


Auscbjj^iiin&n. Max* Ian 




Bull, William Ea 


1. 1.1. 


Bailev. D« 


A 


Burton, Doloren Ma 


9 


Ballev. Richard W. 

^ a*«fcWiaoA •» # 




Burton, NaGa 




Bakava. R«M« 


10 


Bushnell, Pr*ul Fa 




BakeTa Sidney. Ja 


11. 12 


Card, William 




Bar Millal Y 


13 


Carrol, Jean 


16 


Bar"hfn» n^T* 

m*4 W4 ^ s#aMa 


111 


Carroll, John B* 


U8, U9, 50, 




i5 




52, 53» 5**, 


Barth. Gilhert 


xD 




56, i)7, 58 




17 


L arte r , unar jl6 s w a 


101 




17 f lo 


Chaplin, Ha 


59 


JWS JT ^ £i a V a 


19 


ChomsiQrt M« 


Cf\ 

cO, ol, q2. 






ChOtlOS, JaWa 




Belonogov, G,G, 


21 


Cnreti'^n , DaC • 


05 


Berckei, J, A, 


22 


Cole, La 


oc 


Burger, K, 


23, 2U, 25 


Condon, EaUa 




Berkovitz, A, 


im 


Corstius, H-* Brandt 


22 


Berry, Jack 


26, 27 


Daiji, s. 


£.tk 

00 


BlacV, John W. 


28 


Dale, £• 


69, 70 


Flankenship, Jane A.. 


29 


DavieB, A, (ed.) 


71 


Bloch, B, 


30 


Davie 8, P, 


57 


Bond, 0. 


285 


Denes, P.B. 


72 


Bongere, K, 


31 


DeVito, Joseph A, 


73, /U, 1 


Booth, Andrew D« 


32 


Dewey G. 


77 


Borkc , Harold 


33 

292 







,70 r, 







TO 


Gen. Xtasaka 


136 


^AA9UU| Alt V™A« , 




79 


George, Alexander L. 


UU 






OU 


George, K.U. 


115 






Ol 


Gerganov, Y.N. 


107 






o2 


Gibson, Toman 


116, 12U 






Ah 


Gilmore,» T. 


117 






Ac 


Good, I.J, 


U8 






246 


Gougenheim, G. 


119 


AOIaunaSOn t n • r • 




o6 


Graham, E, 


120 






87 


Green, J.R, 


121 






OO 

00 


Greenvay, P.J. 


122 


iScXiegorcif Alvar 


89, 90 


• 91. 92 


Gross, M« 


123 


i!i8 uOUp f U • iS« 




93 


Gruner, Charles R. 


116, 12U 






nil 


Guilbert, Louis 


12U 






95 


Guiraud, Pierre 


126, 127 


Flood, W, 




96 


Gutkind^ A, 


289 


Ford, Donald F, 






Heurvood.t F,W« 


128 


Fovler, Murray 




9T 


A ^ ^ » t 
HasSi W« 


278 


Francis, W, 




loo 


Haydon, Rebecca E. 


129 


Franjslln, H. 




oA 

90 


Hays, D.C-, 


130 


FMch Ministry of National 




Henaon, V.A.C. 


131 


Education 




99, 100 


Kerdan, G, 


132, 133, 


French, Noraan R. 




101 




131*, 135 


Friedaan, £.A, 




195 


Herold, CD, 


219 


Fries, A.C, 




103 


Hibbert, H, 


136 


Fries, C. 


102, 


103 > lOU 


Hill, Archibald 


137 


Fries, C,C, 




172, 173 


Hill, L. 


133 


Fruttijl-a, F,M, 


105, 106, 107 


Hols te in, A«.P. 


139 


Fry, Dennis 






Horn, E. 


IHO 


Fucks, WUhelffi 




109, 110 


Hornby, A.S, 


217 


Gaomion, Edward R, 




111 


KorowitB, W, 


Ikl 


Sarcia Hoz, V, 




112 


Horovits, M,W, 




Garvin, Paul (ed.) 




113 


Horton, Iver 


250 



293 



ERIC 



236 



I 

t 



Hoves, !)• 
Hx&ltzen, I. 
Ichiro, S. 
Jakobovitz, L.A. 
Joanson, D.B. 
Jobi>8on, F. 
Jones, L.V. 
Jones, R.M. 
Joos, Mai*tin 
Jorden, £. 
Josiielyn, H. 
Kaeding, F.W. 
Kaplan, B. 
Karlgren, Hans 
Keil, Holf-Dietrich 
Kelly, Francis, J. 
Klbler, Ro>x>rt J. 
Klhouka, T* 
Kochi, D. 
Koenig, Walter 
Koutsoudaa, Andreas M. 
Kramslqr, J. 
Kraus, Jirf 
Krishnomurthye K.H« 
Kroeber, Karl 
Krohn, R. 
Kublin, H. 
Kucera, H, 
Kvasa, S. 
LachBan, R* 
Lado, R. 

Losab, Sydney .M. 



1U3 




Lamendella, John T. 


58 


lUU 




LeBreton, P. 


175 


1U5 




Licklider, J.C.R. 


U5 






Light, Richard L. 


176 


IU7 




Loogman, A. 


177, 178 


1U8 




Lorge, I. 


179, 180, 


IU9 






267 


150 




MacMurriiy, E. 


80 


151, 


152 


MacPhee, E. 


^♦0 


30, 


153 


Machol, Robert E^ 


161 


15 u 




Mackey, W.F, 


181, 182 


155 




Malcolm, J. 


183 


279 




Mandelbrot, Benoit 


18U 


156 




M&rchand, H. 


185 


157 




Marchand, M. 


186 


116 




Martin, S.E. 


59, 187, 


I2U 






188, 189 


158 




Mav, J. 


190, 191 


159, 


160 


Mayaji, Hiroshi 


192 


101 




McCalla, Gordon I 


193 


161 




McCaruB, Ernest 


19k 


162 




McDavid, Virginia 




169 




McGovern, W. 


195 


I6U 




Moddleton, Ivor G. 


2U5 


165 




Meir, Helmut 


196 


166 




Meiklc, H. 


98 


167 




Michea, R. 


119 


168, 169 


Mrilic, Louis T. 


197 


117 




Millar, D.E. 


19 


i7J 




Miller, G. 


62, 63c 198 


171, 


172, 




199 


173 




Miron, M. 


lUU 


Ylk 




M^kken, R.J. 


22 



294 



Monroe, G. 


169 


Raomiiny, RaJX 


19k 


Moore, W« 


200 


Rapoport, Anatol 


231 


Morgan, B.A, 


201 


Razik, T» 


69 


Morris, K. 


251 


Reed, David W. 




Muller, Charles 


202, 203, 


Reichert, D. 


70 




20k 


Resnikoff, H.L, 


80 


Nekajlma, K. 


295 


' Richards, Jack C. 


2 33 


National Institute of 




Richaan, B. 


57 


Health 


205 


Rivenc, D. 


119 


(The) National Language 




Roberts, A. Hood 


231* 


Research Institue (of Japan) 


Robinson, V.P. 


235 


206, 207, 206, 209 


Rodrldg\2ez-Bou, 1. 


2 36 




210 


Rose-Innes, A. 


237 


Nevman, Edwin B. 


199, 211 


Rosman, E. 


2U0 


Newman, J.B. 


1U2 


T^ubenstein, Hubert 


1, 2 


Nice, Margaret Morse 


212 


Russo, G.A. 


238 


Nihonmatsu, R. 




Rutherford, R.W. 


239 


Nishet, J.D. 


213 


Sacleux, C. 


2kl 


Oettinger, Anthony G. 


2lU 


Sampson, Jeffrey R, 


193 . 


Ogawa, Y. 


200 


Sauvage, A. 


119 


Ogden, C. 


215 


Savard, J.G. 


l8l. 182. 2U2 


Palj&er, H, 


216 




2U3 


Palmer, K«E. 


217 


Scholes, Robert J. 


2kk 


Perrott, D.V. 


218 


Schonell, Fred J. 


2U5 


Petty, W.T. 


219 


Seashore, Robert H. 


2U6 


Pfeffer, A. 


220, 221, 222 


Sebeok, Thomas A. (ed. ) 


2U7 




223 


Shannon , C.E., 


2U8 


Pimsleur, Paxil 


22U 


Shapiro, B.J, 


2U9 


Plath, Warren 


225 


Shaw, B.A. 


2U5 


Polonie, E.G. 


226, 227 


Sherwood, John 


250 


Posner, Rebecca 


228 


SillakuB, H. 


251 


Pressman, A. 


229 


Simon, Herbert A. 


252 


Puria, L. 


230 


Skinner, B.F. 


253 



?9R 



Society For International 




West, M. 96, 


259, 280, 281, 


Cultural Relations (Japan) 


25U, 


282 


, 283, 285 




255 


Wbatfflough, J. 


286 


Somers, H.H. 


256 


Wblteley, W, 


287, 288, 289 


Spolsky, Bernard 


257 


Vljngaarden, A. Van 


22 


Starkueatber, J. A* 


18 


VllllanB, F. 


262 


Stoll, E. 


219 


Vilaon, P. 


290, 291 


Stone, P., et al 


258 


Winter, Ralph Dana 


292 


Strain, J. 


98 


Wlsby, R.A. (ed.) 


293 


Svenson, E. 


259 


Wright, A.M. 


128 


Svenson, Rodney 


260 


Wright, C.W. 


29U 


Tadaehl, Klkuoka 


261 




107 


Tannenbaum, Percy H. 


262 


Yamaglwa, J. (ed.) 


295, 296 


Tarnoczl, Lorant 


263 


Young, I. 


297 


Taylor, G, 


26U 


Zale, E.M. (ed.) 


298 


Thompson, Godfrey H. 


265 


Zimmerman, Jon E. 


299 


Tboznpson, J. 


2Ck 


Zlpf , G.K. 


300, 301, 302, 


Thorudlke, E. 179, 266, 267 




303, 30U 


Traver, A. 


lOU 






Uhlirova% L. 


268 






University of Michigan 









(English Language Institute) 269, 270 



Vakar, Nicholas P. 


271, 272 


Van, Th. M. 


22 


Van den £^/nde, R. 


273 


Van Spaandonck, M. 


27U 


Vasllevich, A.P. 


275 


Voelker, Charles H. 


276 


Walsh, S. (ed.) 


277 


Waugh, Nancy C. 


211 


Wears, M. 


239 


Wepman, J.H. 


IU9, 278 


Werner, K. 


279 



296 



ERIC 



4 



Unclassified 



Sfiiinu C 1.1 ssifu'iition 



DOCUMENT CONTROL DATA R&D 



Unclassified 



Syracuse University Research Corporation 
Merrill Lane, University Heights 
Syracuse, New York 13210 

Ml P0»> r • 1 I L E 

The Counting of Words: A Review of the History, Techniques, and Theory of Word 
Coimts with Annotated Bibliography* 



OESTHif fivE NOTES (Type ol report .i/i.i inclusivr d>iles) 

Special Report 1 July 1972 - 15 May 19T3 

Au Tho««*>« (hirst Doim*. middle tntttah last lutme^ 



James E, DeRocher, Murray S. Miron, San M, Patten, Charles .'J* Pratt 



May 1973 


TOl^AU NO Of- MAGtS 


76. NO OF REFS 

398 


DAAG-O5-T2-C-05Ti* 

P ROJfe C T NU 


'^rfi. ORIGINATOR'S RtPC»*T NUMHC.H<5| 

SURC TR T3-177 


^^. OTHETR «EPO"» NO(S» (Anv othvr nunibefs that may b0 m^si^ed 
//lis n*port) 

NONE 


^ Approved for Public release. Distribution unlimited. 


NONE 

A US Tl^ A C ^ ■■ ■ . 


Defense Language Institute, 



As part of a continuing project of language analysis, 
on the nature and history of frequency counts « The first 
' history of such counts and traces them from Early Hellenic 
Section Tvo is an analysis of techniques used and describe 
limitations of frequency counts taken in both the English 
Section Three is an analysis of the statistical lawfulness 
tributions and presents a comparison and evaluation of the 
to describe vocabulary distributions. Section Four is an 
with an author inde::. provided. 



SURC presents an essay 
section deals with the 

times to the present, 
s the capabilities and 
and Foreign Languages, 

of vocabulary dts- 

theoretlcal models used 
annotated bibliogrt4)hy 



300 



rORM 

1 so w 



.,1473 



""'^^ ^07. 6801 

ERIC 



Vnq iftsalfied 



S«*cuitiv Classification 

I 



Unclasaified 



Spc-urits riassificaf ion 



LINK B 



UINK e 



ROLE 



W 1 



WT 



POLE 



Structural Analysis 
Languages 

Mathematical Linguistics 
Vocabulary 
Lemguage Research 
Descriptive Linguistics 
Contrastive Linguistics 
Etymology 



DD .?o?„1473 f^^cK) .QQQ 



■ f'AGf 2) 




ERIC 



